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[57] ABSTRACT 

A method and an apparatus for providing scalable layers of 
highly available applications using loosely coupled com- 
mercially available computers. The software running on the 
loosely coupled computers is divided into three layers: the 
system layer, the platform layer, and the application layer, 
each having its own process group activation and fault 
recovery strategy. A process group contains software pro- 
cesses that depend upon a set of resources common to the 
process group. In addition to depending upon a common set 
of resources, processes within a process group share a fault 
recovery strategy. Fault recovery is performed at the process 
group level, such that if one process within a process group 
fails, fault recovery is takes place for all processes within the 
process group. In the preferred embodiment, an application 
layer process group may be paired with another application 
layer process group on a separate computer. As part of 
certain escalated process group fault recovery strategies, 
upon taking an application layer process group out of 
service, its paired application layer process group, if any 
exists, takes over performing the functions of the process 
group that was taken out of service. 

88 Claims, 2 Drawing Sheets 
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METHOD AND APPARATUS FOR 
PROVIDING SCALEABLE LEVELS OF 
APPLICATION AVAILABILITY 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 

The present invention relates to computer system 
architectures, and more particularly to a reliable cluster 
computing architecture that provides scalable levels of high 
availability applications simultaneously across commer- 
cially available computing elements. 

2. Statement of Related Art 

Prior art high availability clustered computer systems are 
typically configured in an architecture having shared physi- 
cal storage devices, such as a shared disk. Therefore, prior 
art cluster offerings are typically based on physical 
hardware, or clustered arrangements of systems and storage, 
particularly adapted to a unique application processing envi- 
ronment. In a common type of prior art high availability 
cluster, all of the critical application data must reside on an 
external shared disk, or on a pool of disks, that is accessible 
from at most one computing system in the cluster. Such a 
prior art cluster tries to isolate access to the data partitions 
on the disk so that access to the shared disk is limited to only 
one computing system at a time. Upon failure of the primary 
computing system, a takeover occurs whereby the high 
availability cluster reallocates access to the disk from the 
primary computing system to the dedicated backup system. 
Once such a reallocation is performed, the applications on 
that backup system will have access to the disk. 

Another prior art high availability cluster solution is a 
multi-processor cluster. Like the shared-disk cluster, the 
multi-processor cluster is a hardware -based cluster arrange- 
ment of computing systems. Unlike the shared-disk cluster, 
in which the computing systems arc essentially unrelated to 
each other, the computing systems in a multi-processor 
cluster are all running the same application and using the 
same data at virtually the same time. All physical storage is 
configured to be accessible to all computing systems. Such 
multi-processor clusters, in an attempt to control access to 
concurrent data, typically use lock management software to 
manage access to data and prevent any data corruption or 
integrity problems. The loss of a computing system from a 
multi-processor cluster allows the remaining systems to 
continue processing the data. 

Another prior art high availability cluster solution is a 
symmetrical multi-processing, or scalable parallel 
processing, cluster based on a shared memory or system bus 
architecture where the memory is common to multiple 
computing systems. Such systems, in an attempt to improve 
performance by scaling the number of computing systems in 
the symmetrical multi-processing cluster, allow a single 
computing system failure to cause the entire symmetrical 
multi-processing or scalable parallel processing cluster plat- 
form to become unavailable. 

Yet another high availability cluster architecture is a 
multiple parallel processor cluster, in which each computing 
system has its own memory and disk, none of which are 
shared with any other computing system in the cluster. If one 
system has data on a disk, and that data is required by 
another computing system, the first computer sends the data 
over a high speed network to the other computing system. 
Such multiple parallel processor clusters, in an attempt to 
improve performance by allowing multiple computing sys- 
tems to work concurrently, allow data associated with a 
failed computing system to become unavailable. 
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The prior art high availability clusters, in trying to provide 
different levels of availability, have used operating system- 
based clusters to optimize the unique data and application 
characteristics for a specific targeted commercial market. 

5 Such a targeted approach does not lend itself well to certain 
industries, including telecommunications, in which numer- 
ous legacy applications currently exist, each with unique 
recovery and performance characteristics running on pro- 
prietary hardware, some of which is fault tolerant. 

10 Therefore, a computing system architecture that provides 
varying levels of high availability applications simulta- 
neously across one or more loosely coupled commercially 
available computing elements using a commercially avail- 
able interconnect is desirable. 

15 The prior art high availability cluster solutions have the 
capability to support "heartbeats'* and recovery of a speci- 
fied application. The most significant architectural differ- 
ence between the prior art solutions is the method for 
determining how an application and/or computing system is 

20 chosen or controlled to be active or standby and the method 
for determining when they will be allowed access to the 
application data. Typical physical high availability cluster 
solutions determine the status of the configuration via a set 
of redundant communication facilities between the pair of 

2S computing systems. Under most circumstances, the paired 
systems are able to determine which system is active for an 
application. 

In prior art high availability solutions, when all commu- 
30 nication is lost between computing systems, the computing 
systems or clustered applications might each take on an 
active role believing that the other has failed. Such a 
situation presents an undesirably high risk of application 
data and processing being corrupted. Several added levels of 
35 protection and safety are possible to prevent that from 
happening. Some solutions in the prior art, nearly eliminate 
this risk using heartbeats through the shared storage. Since 
certain cluster solutions do not need to use shared storage, 
a platform neutral hardware component is desirable to 
40 complement the software -based cluster components. It is 
therefore an object of this invention to provide scale able 
layers of highly available application processes using 
loosely coupled commercially available computing ele- 
ments. 

45 SUMMARY OF THE INVENTION 

This invention provides a method and an apparatus for 
providing scaleable layers of high availability applications 
using loosely coupled commercially available computing 

so elements, also referred to as computers. Computing ele- 
ments refers to any type of processor or any device con- 
taining such a processor. 

Resource dependencies and fault recovery strategies 
occur at the process group level. For example, a process 

55 group containing three processes might depend upon four 
resources, such as other process groups or peripheral 
devices, such as a disk. Upon failure of a single process 
within the process group or upon failure of a single resource 
depended upon by the process group, fault recovery will be 

6o initiated for the entire process group, as a single unit. 

Process groups can belong to one of three layers: the 
system layer, the platform layer, or the application layer. In 
the preferred embodiment, each layer has a unique process 
group activation and fault recovery strategy. In the preferred 

65 embodiment, an application layer process group may be 
paired with another application layer process group on a 
separate computer. As part of certain escalated process group 
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fault recovery strategies, upon taking an application layer 
process group out of service, its paired application layer 
process group, if any exists, takes over performing the 
functions of the process group that was taken out of service. 

Application layer process groups depend upon one or 
more platform layer process groups, which depend upon one 
or more system layer process groups, which depend upon the 
hardware of the loosely coupled computer hosting the pro- 
cess groups. 

Upon a system layer process group failure, all process 
groups on the host computer are taken out of service, which 
includes activating on another computer or computers any 
application layer process group that is paired with an appli- 
cation layer process group taken out of service, the computer 
hosting the failed system layer process group is re-booted, 
and all system layer, platform layer, and application layer 
process groups are re-initialized. 

Upon a platform layer process group failure, the platform 
layer process group may be re-started zero or more times. If 
re-starting the failed platform layer process group does not 
cure the platform layer process group failure or if the 
platform layer process group is not restartable, all applica- 
tion layer and platform layer process groups on the host 
computer are taken out of service and re-initialized, which 
includes activating on another computer or computers any 
application layer process group that is paired with an appli- 
cation layer process group taken out of service on the host 
computer. 

Upon failure of a resource depended upon by a platform 
layer process group, all application layer and platform layer 
process groups on the host computer are taken out of service 
and re-initialized, which includes activating on another 
computer or computers any application layer process group 
that is paired with an application layer process group taken 
out of service on the host computer. 

Upon failure of an application layer process group, the 
failed application layer process group may be restarted zero 
or more times. If restarting the failed application layer 
process group does not correct the application layer process 
group failure or if the failed application layer process group 
is not restartable, then the failed application layer process 
group is taken out of service, which includes activating on 
another computer the application layer process group, if any, 
that is paired with the application layer process group taken 
out of service. 

Upon failure of a resource depended upon by an appli- 
cation layer process group, the dependent application layer 
process group is taken out of service, which includes acti- 
vating on another computer the application layer process 
group, if any, that is paired with the dependent application 
layer process group taken out of service on the host com- 
puter. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a block diagram illustration of the subject 
invention, including four loosely coupled computers, an 
independent computer, and a maintenance terminal. 

FIG. 2 is a state diagram illustrating the sequence of states 
that a process group in the subject invention can transition 
through. 

DETAILED DESCRIPTION OF THE 
PREFERRED EMBODIMENT 

FIG. 1 shows the preferred embodiment of the subject 
invention, including a maintenance terminal (MT) 2, an 
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independent computer (IC) 4, four industry standard com- 
mercially available computing elements 6 (also referred to 
as computers 6) loosely coupled together through an inter- 
connect 8, such as a network, a computer bus architecture, 

5 and the like. Computing elements 6, also referred to as 
computers 6, could be any type of processor or any device 
containing such a processor. 

Each computer 6 is running process group management 
software 10, also referred to collectively as the process 

10 group manager. The process group manager 10 activates 
process groups and initiates fault recovery strategies at the 
process group level. A process group is a group of processes, 
which are typically implemented in software, that are related 
to each other in some way such that it is desirable to manage 

15 the process group as a single unit. It may be desirable to 
restart all of the processes within a process group together, 
or it may be desirable, as part of an escalated fault recovery 
strategy, to have the functionality of all of the processes in 
the process group performed by a process group on a 

20 separate computer. The process group might, but does not 
have to, depend upon a resource, or a set of resources 
common to the process group. 

The independent computer 4 is preferably a computing 
device designed to have a minimal number of faults over an 

25 extended period of time. The independent computer 4 moni- 
tors computers 6 for hardware faults using heartbeats, as 
disclosed in commonly assigned U.S. Pat. No. 5,560,033. 
Each of the loosely coupled computers 6 is coupled to the 
centralized computer 4 as shown with reference numeral 12 

30 in FIG. 1. 

Process groups may belong to one of three layers: the 
system layer 13, the platform layer 15, and the application 
layer 17. Each computer 6 is shown running process group 

35 management (PGM) software 10, one system layer process 
group (SLPG) 14, one platform layer process group (PLPG) 
16, either one or two application layer process groups 
(ALPG) 18 that are not of a process-group pair (PGP) 24, 
one primary application layer process group (P-ALPG) 20 

40 that is part of a process-group pair 24, and one alternate 
application layer process group (A-ALPG) 22 that is part of 
a process-group pair 24. By definition: (1) each process- 
group pair 24 contains one primary process group 20 on a 
first computer 6 and one alternate process group 22 on a 

45 second computer 6; (2) each primary process group 20, and 
each alternate process group 22, is part of a process-group 
pair; and (3) process groups 14, 16, and 18 are not part of a 
process-group pair 24. In the preferred embodiment, 
process-group pairs may belong only to the application layer 

50 17; however, process-group pairs could be provided at the 
system layer 13, the platform layer 15, or the system layer 
13, or at all of these layers, without departing from the scope 
of this invention. 

Although each computer 6 in the preferred embodiment is 

55 running the number of certain types of process groups 
mentioned above; without departing from the scope of this 
invention, each computer 6 could be running: (1) zero or 
more process groups 14, 16, and 18; (2) zero or more 
primary process groups 20; and (3) zero or more alternate 

60 process groups 22. 

Similarly, the number of computers 6 can be two or more 
without departing from the scope of this invention, and, 
although the process group management software 10 is 
shown running on all four of the loosely coupled computers 

65 6, it could be running on any permutation or combination of 
the loosely coupled computers 6 and/or the independent 
computer 4 without departing from the scope of this inven- 
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tion. Further, the functions performed by the independent words, a human operator must enter a command from the 

computer 4 could be performed using amy type of device local maintenance terminal 2 to put one or more process 

capable of running any type of processor, such as a periph- groups 14, 16, 18, 20, and 22 into the Off-line state 38 or to 

eral board with a digital signal processor, a system board remove one or more process groups 14, 16, 18, 20, and 22 

with a general purpose processor, a fault tolerant processor 5 from the Off-line state 38. The Off-line state 38 may be 

and the like, without departing from the scope of this entered under circumstances other than by manual 

invention. operation, such as upon a command from the process group 

The independent computer 4 uses industry-standard inter- manager 10, without departing from the scope of this 

faces 12, such as RS-232 in the preferred embodiment, invention. 

although any interface could be used. This provides the 1Q Assuming there are no resource or process group faults 

capability of leveraging general networking interfaces, such and neither the primary nor the alternate process groups 20 

as Ethernet, Fiber Distributed Data Interface (FDDI), Asyn- and 22 have been manually transitioned to the Off-line state 

chronous Transfer Mode (ATM), and protocols, such as 38, process-group pairs 24 contain a primary process group 

Transmission Control Protocol/Internet Protocol (TCP/IP), 20 and an alternate process group 22 in an Active 

or Peripheral Component Interconnect (PCI). l5 36/Standby 34 paired relationships: active/cold-standby or 

Each loosely coupled computer 6 typically contains a active/hot-standby. In the preferred embodiment, the pri- 

uni-processor, multi-processor, or fault tolerant processor mary process group 20 is initialized to the Active state 36 

system having an operating environment that is the same as, and the alternate process group 22 is initialized to the 

or different from, the operating environment of each of the Standby state 34. 

other loosely coupled computers 6. In other words, separate 20 Although primary process groups 20 are initialized to 

computers 6 can run different operating systems, for Active 36 and alternate process groups 22 are initialized to 

instance, WINDOWS as opposed to UNIX, different oper- Standby 34, under certain conditions, an alternate process 

ating environments, for instance, real-time as opposed to group 22 can be Active 36 while its paired primary process 

non-real-time, and have different numbers and types of group 20 is Standby 34. For example, if a fault occurs in a 

processors. Each of the loosely coupled computers 6 can be 2 5 primary process group 20 contained in a process-group pair 

either located at the same site or geographically separated 24 having an Active 36/Standby 34 paired relationship, the 

and connected via a network 8, such as a local area network process group manager 10 will transition that primary pro- 

(LAN) or a wide area network (WAN). cess group 20 from Active 36 to Unavailable 30 and tran- 

As previously mentioned, each process group 14, 16, 18, sition the alternate process group 22 from Standby 34 to 

20, or 22 contains one or more processes that may depend 30 Active 36. Once the fault that caused the primary process 

on a set of resources common to the process group 14, 16, group 20 to be transitioned to Unavailable 30 has been 

18, 20, or 22. For example, a set of such resources could corrected, the primary process group 20 will be transitioned 

include a computer hardware peripheral device, or another to Standby 34 until some event, such as an alternate process 

process group 14, 16, or 18, 20, or 22, a communication link, group fault occurs to cause the alternate process group 22 to 

available disk space, or anything that might affect the 35 transition to Unavailable 30, at which time the primary 

availability of an external application. Each alternate pro- process group 20 will be transitioned from Standby 34 to 

cess group 22 depends upon a set of resources that is Active 36. A process-group pair's primary and alternate 

functionally equivalent to the set of resources depended process groups 20 and 22 can be switched from Active 

upon by the alternate -process-group's paired primary pro- 36/Standby 34 to Standby 34/Active 36 manually or under 

cess group 20. The set of resources depended upon by a 40 circumstances in which switching the Active 36 process 

primary process group 20 and the separate set of resources group 20 or 22 to Standby 34 and the Standby 34 process 

depended upon by its paired alternate process group 22 do group 20 or 22 to Active 36 is desirable, 

not, however, have to contain the same number of resources. The availability of an application layer process group 18, 

Each of the processes contained within a process group 14, 20 or 22 typically depends upon the availability of one or 

16, 18, 20, or 22 also has an activation and fault recovery 45 more platform layer process groups 16. The availability of a 

strategy common to the process group 14, 16, 18, 20, or 22. platform layer process groups 16 typically depends upon the 

FIG. 2 shows the states through which every process availability of one or more system layer process groups 14, 

group 14, 16, 18, 20, and 22 may transition, namely, and the availability of system layer process groups 14 

Unavailable (Unavail) 30, Initialization (Init) 32, Standby depends upon the availability of the hardware of the com- 

34, Active 36, and Off-line 38. Process groups in the 50 puter 6 hosting the system layer process groups 14. System 

Unavailable 30 and Initialization 32 states have not been layer process groups 14 are initialized before platform layer 

started and, therefore, are not running. Process groups in the process groups 16, which are initialized before application 

Active state 36 have been started and are running. Whether layer process groups 18, 20, and 22. 

a process group in the Standby state 34 has been started and This invention provides the flexibility to implement exter- 

is running depends upon whether the process group is a 55 nal applications using various numbers of process groups 

hot-standby or a cold-standby process group. Hot-standby 14, 16, 18, 20, and 22, and/or process-group pairs 24 spread 

process groups in the Standby 34 state have been started and across two or more computers 6. For instance, the four 

are waiting to be activated. Cold standby process groups in process-group pairs 24 shown in FIG. 1 could be part of one 

the Standby 34 state are not started until they are activated. external application, or they could be part of four, or three, 

Activation of cold standby process groups may also involve 60 or two, separate external applications. In addition, two of the 

initializing any uninitialized resources depended upon by the process group pairs 24 shown in FIG. 1 could be part of the 

Cold -standby process group. In a non-fault condition, pri- same external application, yet be hosted by two separate 

mary process groups 20 are initialized to run in the Active pairs of computers 6. Further, one or more process-group 

state 36, and alternate process groups 22 are initialized to the pairs 24 and/or one or more process groups 14, 16, 18, 20, 

Standby state 34. 65 or 22 may depend upon the same resource as one or more 

In the preferred embodiment, the Off-line state 38 can other process-group pairs 24 and/or one or more process 

only be entered and exited by manual operation. In other groups 14, 16, 18, 20, or 22, such that the failure of a single 
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resource could cause a plurality of process groups' and/or a If re-starting the failed platform layer process group 16 

plurality of process-group pairs'fault recovery strategies to does not correct the platform layer process failure, fault 

be performed. recovery is escalated to: (1) taking all application layer 

Each layer, system 13, platform 15, and application 17, process groups out of service, including activating any 

has unique process group activation and fault recovery s process groups 20 or 22 on separate computers that are 

strategy. In the preferred embodiment, the system layer running Standby 34 and that are paired with an application 

contains non-restartable non-relocatable process groups 14. layer process group 20 or 22 taken out of service, if any such 

The platform layer may contain either, or both, of two types paired Standby process groups 20 or 22 exist; and (2) 

of process groups. Platform layer process groups 16 may be re-initializing all platform layer process groups 16 on the 

either: (1) non-restartable and non-relocatable; or (2) restart- 10 computer hosting the failed platform layer process. If the 

able and n on -re local able. The application layer 17 may failed platform layer process and the process group in which 

contain any, or all, of the following three types of process it is contained are both non-restartable, then this escalated 

groups: (1) non-restartable and non-relocatable process f aim recovery strategy is the initial fault recovery action 

groups 18; (2) restartable and non-relocatable process ^ kcQ escalated platform layer process group fault 

groups 18; and (3) restartable and relocatable process groups recovery procedure is also implemented upon detection of a 

20 and 22. Relocatable refers to relocating performance of fauh m a resource upon by a latform { 

the functionality of a primary process group 20 or an ^ lfi wim me added of re . initializi lhe 

alterna e process group 22 from one computer 6 to another resource fof whicfa a ^ ^ {{ ^ itdi i^ g all 

computer 6, rather than any type of re-location within the . - . t£ . # , , °. 

same computer 6. Primary process groups 20 and alternate P^tform layer process groups 16 on the computer 6 hosting 

process groups 22 are relocatable. Process groups 14, 16, 20 the failed platform layer process or resource ("hosting 

and 18 are not relocatable. As previously mentioned, in the computer* ) does not cure the failure, hosting computer 6 is 

preferred embodiment, a primary process group 20 and a ^-booted, thereby causing all process groups 14, 16, 18, 20, 

secondary process group 22 can belong to only the appli- and 22 00 costing computer 6 to be re-started. In the 

cation layer 17. Nevertheless, it will be obvious to those preferred embodiment, hosting computer 6 can be re-booted 

having ordinary skill in the art that primary and alternate 25 a pre-determined number of times over a pre-determined 

process groups 20 and 22 could belong to the platform time period. If re-booting hosting computer 6 does not cure 

and/or system layers and that other permutations and com- the failure, independent computer 4 will then power cycle 

binations of layer-based fault recovery and process group hosting computer 6, thereby re-booting hosting computer 6. 

activation strategies and restartability and relocatability can In the preferred embodiment, hosting computer 6 can be 

be implemented without departing from the scope of this 3Q power cycled a pre-determined number of times over a 

invention. pre-determined time period. If power cycling hosting coro- 

Upon a system layer process group failure: (1) all system puter 6 does not does not clear the platform layer process 

layer, platform layer, and application layer process groups gr0 up failure, independent computer 4 will cut off power to 

14, 16, 18, 20, and 22 on the computer 6 hosting the failed hosting computer 6, which will remain in a powered down 

system layer process group ("host computer") are taken out slate 

of service by transitioning tbem to the Unavailable state 30; 35 Up0Q M o£ a process m an ap pi ication i ayer process 

(2) for each primary and alternate process group 20 and 22 * lg 2Q m 22 £ M £ fc 

that is transitioned from Active 36 to Unavailable 30 on the & r . . ' - r , . J. , t ^ 

host computer, its paired process group 20 or 22 is activated or morc . tim f ■ In thc P rcfcrrcd embodiment, re-startable 

on a separate computer 6 by transitioning the paired process ^plication layer processes may be re-started a pre- 

group 20 or 22 from Standby 34 to Active 36; and (3) the 40 determined number of times over a predetermined time 

computer 6 that is hosting the failed system layer process period. 

group 14 is rebooted. In the preferred embodiment, Upon failure of re-starting the failed application layer 

re-booting can be done a pre-determined number of limes process to cure the failure, the process group 18, 20 or 22 

over a pre-determined time period. If re-booting host com- containing the failed process may be re-started zero or more 

puter 6 does not clear the fault, the independent computer 4 45 times. In the preferred embodiment, re-startable application 

will then power cycle the host computer 6, thereby layer process groups 18, 20, and 22 may be re-started a 

re-booting the host computer 6. In the preferred pre-determined number of times over a pre-determined time 

embodiment, power cycling can be done a pre-determined period. When such a process group 18, 20 or 22 is restarted, 

number of times over a pre-determined time period. If power it is restarted in its previous running state, 

cycling the host computer does not does not clear the system 50 if the failed application layer process group 18, 20, or 22, 

layer process group failure, the independent computer 4 will jg an application layer process group 18, or a primary or 

cut off power to the host computer 6, which will remain in alternate process group 20 or 22 running Active 36 and 

a powered down state. re -starting such a failed process group does not correct the 

Upon a platform layer process failure, the failed process application layer process failure, fault recovery is escalated 

may be re-started zero or more times. In the preferred 55 to: taking such a failed application layer process group out 

embodiment, re-startable platform layer processes may be of service by transitioning it to the Unavailable state 30, and, 

re-started a pre-determined number of times over a pre- for such a primary or alternate process groups 20 or 22, 

determined time period, for instance three re -starts within activating its paired standby process group 20 or 22 on a 

five minutes. separate computer 6. This escalated fault recovery strategy 

Upon failure of re-starting the failed platform layer pro- 60 is the initial fault recovery strategy for process faults of 

cess to cure the failure, the process group 16 containing the non-restartable processes contained within non-restartable 

failed process may be re-started zero or more times. In the application layer process groups 18, 20 and 22. This esca- 

preferred embodiment, re-startable platform layer process lated fault recovery strategy is also used upon detection of 

groups 16 may be re-started a pre-determined number of a fault in a resource depended upon by an application layer 

times over a pre-determined time period. When such a 65 process group 18, 20, or 22 running in the Active state 36, 

process group 16 is restarted, it is restarted in its previous with the added step of re- initializing the resource for which 

running state. a fault was detected. 
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If an application layer process group 18, 20, or 22 cannot 
be taken out of service and transitioned to Unavailable 30, 
fault recovery escalates to re-initializing and re -starting all 
application layer and platform layer process groups 16, 18, 
20, and 22 on the computer 6 hosting the failed application 
layer process group or resource ("computer hosting the 
application layer failure"). In the preferred embodiment, 
such re-initializations and re-starts can be performed a 
pre-determined number of times over a pre-determined 
period on the computer hosting the application layer failure 
6. If re-initializing and re-starting all platform layer and 
application layer process groups 16, 18, 20, and 22 on the 
computer hosting the application layer failure does not cure 
the failure, the computer hosting the application layer failure 
is re-booted, thereby causing all process groups 14, 16, 18, 
20, and 22 on the computer hosting the application layer 
failure to be re -started. In the preferred embodiment, the 
computer hosting the application layer failure can be 
re-booted a pre-determined number of times over a pre- 
determined time period. If re -booting the computer hosting 
the application layer failure does not cure the failure, inde- 
pendent computer 4 will then power cycle the computer 
hosting the application layer failure, thereby re-booting the 
computer hosting the application layer failure. In the pre- 
ferred embodiment, the computer hosting the application 
layer failure can be power-cycled a pre-determined number 
of times over a pre-determined time period. If power cycling 
the computer hosting the application layer failure does not 
clear the failure, independent computer 4 will cut off power 
to the computer hosting the application layer failure, which 
will remain in a powered down state. 

Application layer process groups 18, 20, and 22 may, or 
may not, depend upon application layer process groups 18. 
Therefore, a fault in an application layer process group 18, 
20 or 22 will not affect the availability of any system layer 
process groups 14 or any platform layer process groups 16. 
A fault in an application layer process group 18, 20 or 22 
will also not affect the availability of any application layer 
process groups 18, 20 and 22 that are not dependent upon the 
failed application layer process group 18, 20, or 22. 
However, any application layer process groups 18, 20, and 
22 that are dependent upon an application layer process 
group 18 that is taken out of service, will also be taken out 
service. 

If a failed primary or alternate process group 20 or 22 is 
running Standby 34 and re-starting such a failed process 
group does not correct the failure, fault recovery is escalated 
to: taking such a failed Standby 34 primary or alternate 
process group 20 or 22 out of service by transitioning it to 
the Unavailable state 30. Upon inability to take such a failed 
Standby 34 primary or alternate process group 20 or 22 out 
of service, the previously described escalated fault recovery 
strategy that is implemented upon inability to take any 
application layer process group 18, 20, or 22 out of service 
is implemented. 

Upon detection of a fault in a resource depended upon by 
an application layer process group 20, or 22 running in the 
Standby state 34 or upon detection of a process failure in a 
non-restartable process group 20 or 22 running in the 
Standby state 34, the application layer process group 20 or 
22 that is dependent upon the failed resource, or the process 
group 20 or 22 for which the failure was detected, is 
transitioned to the Unavailable slate 30. Upon inability to 
take such a process group 20 or 22 out of service, the 
previously described escalated fault recovery strategy that is 
implemented upon inability to take any application layer 
process group 18, 20, or 22 out of service is implemented. 
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The process group manager 16 can make the state of 
individual process groups 14, 16, 18, 20, and 22 and critical 
resources known to either: (1) only those computer systems 
6 hosting the process groups 14, 16, 18, 20, and 22 to which 
5 the state information pertains, or (2) all of the loosely 
coupled computing systems 6. In addition, such state infor- 
mation can be made available to external application soft- 
ware. 

In the preferred embodiment, transitions between process 

10 group states are controlled by process group management 
software 10 running on each of the loosely coupled com- 
puters 6. Nevertheless, such transitions could be controlled 
by process group management software 10 running on any 
permutation and/or combination of the centralized computer 
4 and the loosely coupled computers 6 without departing 

15 from the scope of this invention. 

System layer process groups 14 can contain either: (1) 
operating system software and services found in normal 
commercially available computing systems; or (2) process 
group management software 10. 

10 Resources depended upon by platform layer process 
groups 16 are initialized before initializing platform layer 
process groups 16. Failure to bring a platform layer resource 
in service results in the host computer 6 being re-booted 
results in the same fault recovery strategy as previously 

25 explained for a fault in a resource depended upon by a 
platform layer process group 16. 

During initialization, platform layer process groups 16 
and application layer process groups 18, 20, and 22 are 
designated run able only when all platform layer resources 

30 and platform layer process groups 16 are designated runable. 
Platform layer process groups 16 handshake with the 
process group manager 10 to control the startup sequence of 
platform layer process groups 16. Similarly, application 
layer process groups 18, 20, and 22 handshake with the 

35 process group manager 10 to control the startup sequence of 
application layer process groups 18, 20, and 22. 

Application layer process groups 18, 20, and 22 can be put 
into the Off-line state 38 individually so that maintenance or 
software updates can be performed on the Off-line process 

40 g rou P s 18, 20, and 22 without impacting other process 
groups on the computer 6 hosting the Off-line process 
groups 18, 20, and 22. 

Primary and alternate process groups 20 and 22 within the 
same process-group pair 24 may have a shared resource 
dependency. Process-group pairs 24 that have an active 
cold-standby paired relationship typically provide high 
availability. Process-group pairs 24 that have an active 
hot -standby paired relationship typically provide very high 
availability. 

It will be obvious to those having ordinary skill in the art 
50 that primary and alternate process groups could be arranged 
in a lead -active/active paired relationship, analogous to the 
Active 36/Standby 34 relationship described above, without 
departing from the scope of this invention. Such lead-active/ 
active process-group pairs typically provide ultra- high avail - 
55 ability. 

Application layer resources on a single computing system 
are initialized before initializing primary or alternate process 
groups 20 or 22. Failure to bring a critical application layer 
resource in service results in taking the dependent applica- 

60 tion layer process group 18, 20, or 22 out of service by 
transitioning it to the Unavailable state 30, and activating the 
dependent process group's 20, or 22 paired process group 20 
or 22, if one exists, on a separate computer 6. 

Activated platform layer and application layer process 

65 groups 16, 18, 20, and 22 handshake with the process group 
manager 10 to acknowledge activation to Active 36 or 
Standby 34. 
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Process-group pairs 24 can be provided only when plat- 
form layer process groups 16 are operating normally on the 
computers 6 hosting both the primary process group 20 and 
the alternate process group 22 that are be contained in the 
process-group pair 24. 

Upon a primary or alternate process group's 20 or 22 
failure to acknowledge initiation of activation by the process 
group manager 10, the primary or alternate process group 20 
or 22 for which activation was initiated is transitioned to the 
Unavailable state 30. Similarly, upon an application layer 
primary or alternate process group's 20 or 22 failure to 
acknowledge initiation of de-activation from Active 36 to 
Standby 34, the primary or alternate process group 20 or 22 
for which de-activation was initiated is transitioned to the 
Unavailable state 30. 

Although initiated, activation of a standby process group 
20 or 22 does not occur until the Standby process group 20 
or 22 for which activation has been initiated acknowledges 
initiation of the activation by handshaking with the process 
group manager 10. 

Primary and alternate process groups 20 and 22 belonging 
to the same process-group pair 24 can be put in the Off-line 
state 38 for maintenance without impacting other process 
groups 18, 20, and 22, on any of the loosely coupled 
computers 6. 

Separate application layer process groups 18, 20, and 22, 
running on the same computer 6, or on different computers 
6, can host dissimilar external applications, such as one or 
more application layer process groups 18, 20, or 22 con- 
trolling Code Division Multiple Access (CDMA) cellular 
telephone call processing, one or more process groups 18, 
20, or 22 controlling Time Division Multiple Access 
(TDM A) cellular telephone call processing, one or more 
process groups 18, 20, or 22 controlling Group Special 
Mobile (GSM) cellular telephone call processing, one or 
more process groups 18, 20, or 22 controlling Cellular 
Digital Packet Data (CDPD) cellular telephone call 
processing, and one or more process groups 18, 20, or 22 
controlling Analog Mobile Phone Service (AMPS) cellular 
telephone call processing. Similarly, separate application 
layer process groups 18, 20, and 22, running on the same 
computer 6, or on different computers 6, can host dissimilar 
external applications. 

As will be obvious to those having ordinary skill in the art, 
this invention provides the flexibility to configure one or 
more process-group pairs 24 across two or more loosely 
coupled computers 6 in several high availability computing 
element 6 configurations including active/standby, active/ 
active, and a typical n+k sparing arrangement. 

We claim: 

1. A method for providing high availability applications 
comprising, in combination, the steps of: 

a. running, on at least two computers, one or more process 
groups, at least one of said process groups containing 
one or more processes that have a fault recovery 
strategy common to said at least one of said process 
groups; 

b. running, on at least one of said at least two computers, 
a process group manager that initiates a fault recovery 
strategy for at least one of said one or more process 
groups; 

c. providing, on at least a first of said at least two 
computers, a system layer having at least one process 
group; 

d. taking all of said process groups out of service on said 
first computer upon a system layer process group fault 
occurring on said first computer, 
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e. re-booting said first computer upon a syslem layer 
process group fault occurring on said first computer; 

f. providing at least one paired process group on a second 
of said at least two computers, said paired process 
group being paired with one of said one or more 
process groups on said first computer; and 

g. activating said paired process group on said second 
computer upon a system layer process group fault 
occurring on said first computer. 

2. The method of claim 1 further comprising, in 
combination, the step of: using an independent computer to 
power cycle said first computer upon failure of said 
re -booting of said first computer to cure said system layer 
process group fault. 

3. The method of claim 2 further comprising, in 
combination, the step of: using an independent computer to 
power down said first computer upon failure of said power 
cycling of said first computer to cure said system layer 
process group fault. 

4. A method for providing high availability applications 
comprising, in combination, the steps of: 

a. running, on at least two computers, one or more process 
groups, at least one of said process groups containing 
one or more processes that have a fault recovery 
strategy common to said at least one of said process 
groups; 

b. running, on at least one of said at least two computers, 
a process group manager that initiates a fault recovery 
strategy for at least one of said one or more process 
groups; 

c. providing, on at least a first of said at least two 
computers, a system layer having at least one process 
group; 

d. taking all of said process groups out of service on said 
first computer upon a fault in a resource depended upon 
by at least one of said system layer process groups on 
said first computer; 

e. re-booting said first computer upon said fault in said 
resource depended upon by at least one of said system 
layer process groups on said first computer; 

f . providing at least one paired process group on a second 
of said at least two computers, said paired process 
group being paired with one of said process groups on 
said first computer; and 

g. activating said paired process group on said second 
computer upon said fault in said resource depended 
upon by at least one of said system layer process groups 
on said first computer. 

5. The method of claim 4 further comprising, in 
combination, the step of: using an independent computer to 
power cycle said first computer upon failure of said 
re -booting of said first computer to cure said fault in said 
resource depended upon by at least one of said system layer 
process groups on said first computer. 

6. The method of claim 5 further comprising, in 
combination, the step of: using an independent computer to 
power down said first computer upon failure of said power 
cycling of said first computer to cure said fault in said 
resource depended upon by at least one of said system layer 
process groups on said first computer. 

7. A method for providing high availability applications 
comprising, in combination, the steps of: 

a. running, on at least two computers, one or more process 
groups, at least one of said process groups containing 
one or more processes that have a fault recovery 
strategy common to said at least one of said process 
groups; 
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b. running, on at least one of said at least two computers, 
a process group manager that initiates a fault recovery 
strategy for at least one of said one or more process 
groups; 

c. providing, on at least a first of said at least two 5 
computers, a system layer having at least one process 
group; 

d. taking all of said process groups out of service on said 
first computer upon a system layer process group fault 
occurring on said first computer; 10 

e. re-booting said first computer upon a system layer 
process group fault occurring on said first computer; 

f. providing, on at least said first computer, a platform 
layer having at least one process group; 

g. taking all of said process groups, except each of said at 15 
least one process group in said system layer, out of 
service on said first computer upon a platform layer 
process group fault occurring on said first computer; 

h. providing at least one paired process group on a second 

of said at least two computers, said paired process 20 
group being paired with one of said process groups on 
said first computer; and 

i. activating said paired process group on said second 
computer upon said platform layer process group fault 
occurring on said first computer. 25 

8. The method of claim 7 further comprising, in 
combination, the step of: re-initializing all of said process 
groups, except each of said at least one process group in said 
system layer, on said first computer upon said platform layer 
process group fault occurring on said first computer. 

9. The method of claim 8 further comprising, in 
combination, the step of: re-booting said first computer upon 
failure of said re-initialization to cure said platform layer 
process group fault. 

10. The method of claim 9 further comprising, in 35 
combination, the step of: using an independent computer to 
power cycle said first computer upon failure of said 
re-booting of said first computer to cure said platform layer 
process group fault. 

11. The method of claim 10 further comprising, in 40 
combination, the step of: using an independent computer to 
power down said first computer upon failure of said power 
cycling of said first computer to cure said platform layer 
process group fault. 

12. A method for providing high availability applications 45 
comprising, in combination, the steps of: 

a. running, on at least two computers, one or more process 
groups, at least one of said process groups containing 
one or more processes that have a fault recovery 
strategy common to said at least one of said process 
groups; 

b. running, on at least one of said at least two computers, 
a process group manager that initiates a fault recovery 
strategy for at least one of said one or more process 55 
groups; 

c. providing, on at least a first of said at least two 
computers, a system layer having at least one process 
group; 

d. taking all of said process groups out of service on said 60 
first computer upon a system layer process group fault 
occurring on said first computer; 

e. re-booting said first computer upon a system layer 
process group fault occurring on said first computer; 

f. providing, on at least a first of said at least two 65 
computers, a platform layer having at least one process 
group; 
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g. taking all of said process groups, except each of said at 
least one process group in said system layer, out of 
service on said first computer upon a fault in a resource 
depended upon by at least one of said platform layer 
process groups on said first computer; 

h. providing at least one paired process group on a second 
of said at least two computers, said at least one paired 
process group being paired with one of said process 
groups on said first computer; and 

i. activating said at least one paired process group on said 
second computer upon said fault in said resource 
depended upon by at least one of said platform layer 
process groups on said first computer. 

13. The method of claim 12 further comprising, in 
combination, the steps of: 

a. re-initializing said resource having said fault; and 

b. re-initializing all of said process groups, except each of 
said at least one process group in said system layer, on 
said first computer upon said fault in said resource 
depended upon by at least one of said platform layer 
process groups on said first computer. 

14. The method of claim 13 further comprising, in 
combination, the step of: re-booting said first computer upon 
failure of said re-initializations to cure said fault in said 
resource depended upon by at least one of said platform 
layer process groups on said first computer. 

15. The method of claim 14 further comprising, in 
combination, the step of: using an independent computer to 
power cycle said first computer upon failure of said 
re-booting of said first computer to cure said fault in said 
resource depended upon by at least one of said platform 
layer process groups on said first computer. 

16. The method of claim 15 further comprising, in 
combination, the step of: using an independent computer to 
power down said first computer upon failure of said power 
cycling of said first computer to cure said fault in said 
resource depended upon by at least one of said platform 
layer process groups on said first computer. 

17. A method for providing high availability applications 
comprising, in combination, the steps of: 

a. running, on at least two computers, one or more process 
groups, at least one of said process groups containing 
one or more processes that have a fault recovery 
strategy common to said at least one of said process 
groups; 

b. running, on at least one of said at least two computers, 
a process group manager that initiates a fault recovery 
strategy for at least one of said one or more process 
groups; 

c. providing, on at least a first of said at least two 
computers, a system layer having at least one process 
group; 

d. taking all of said process groups out of service on said 
first computer upon a system layer process group fault 
occurring on said first computer; 

e. re -booting said first computer upon a system layer 
process group fault occurring on said first computer; 

f. providing, on at least a first of said at least two 
computers, a platform layer having at least one process 
group; 

g. restarting at least one of said platform layer process 
groups upon a platform layer process group fault occur- 
ring on said first computer; 

h. taking all of said process groups, except each of said at 
least one process group in said system layer, out of 
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service on said first computer upon failure of said 

re -st art to cure said platform layer process group fault; 
i. providing at least one paired process group on a second 

of said at least two computers, said at least one paired 

process group being paired with one of said process 

groups on said first computer; and 
j. activating said at least one paired process group on said 

second computer upon failure of said re-start to cure 

said platform layer process group fault. 

18. The method of claim 17 further comprising, in 
combination, the step of: re-initializing all of said process 
groups, except each of said at least one process group in said 
system layer, on said first computer upon failure of said 
re-start to cure said platform layer process group fault. 

19. The method of claim 18 further comprising, in 
combination, the step of: re -booting said first computer upon 
failure of said re -initialization to cure said platform layer 
process group fault. 

20. The method of claim 19 further comprising, in 
combination, the step of: using an independent computer to 
power cycle said first computer upon failure of said 
re-booting of said first computer to cure said platform layer 
process group fault. 

21. The method of claim 20 further comprising, in 
combination, the step of: using an independent computer to 
power down said first computer upon failure of said power 
cycling of said first computer to cure said platform layer 
process group fault. 

22. A method for providing high availability applications 
comprising, in combination, the steps of: 

a. running, on at least two computers, one or more process 
groups, at least one of said process groups containing 
one or more processes that have a fault recovery 
strategy common to said at least one of said process 
groups; 

b. running, on at least one of said at least two computers, 
a process group manager that initiates a fault recovery 
strategy for at least one of said one or more process 
groups; 

c. providing, on at least said first computer, a system layer 
having at least one process group; 

d. taking all of said process groups out of service on said 
first computer upon a system layer process group fault 
occurring on said first computer, 

e. re-booting said first computer upon a system layer 
process group fault occurring on said first computer; 

f. providing, on at least a first of said at least two 
computers, an application layer having at least one 
process group; 

g. taking said at least one application layer process group 
out of service on said first computer upon a fault in said 
at least one application layer process group on said first 
computer; 

h. providing at least one paired process group on a second 
of said at least two computers, said paired process 
group being paired with one of said at least one 
application layer process group taken out of service on 
said first computer; 

i. activating said paired process group on said second 
computer upon said fault in said at least one application 
layer process group on said first computer; and 

j. re-initializing all of said process groups, except each of 
said at least one process group in said system layer, on 
said first computer upon not being able to take said 
application layer process group having said fault out of 
service on said first computer. 
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23. The method of claim 22 further comprising, in 
combination, the step of: re-booting said first computer upon 
failure of said re-initialization to cure said application layer 
process group fault. 

24. The method of claim 23 further comprising, in 
combination, the step of: using an independent computer to 
power cycle said first computer upon failure of said re-boot 
of said first computer to cure said application layer process 
group fault. 

25. The method of claim 24 further comprising, in 
combination, the step of: using an independent computer to 
power down said first computer upon failure of said power 
cycling of said first computer to cure said application layer 
process group fault. 

26. A method for providing high availability applications 
comprising, in combination, the steps of: 

a. running, on at least two computers, one or more process 
groups, at least one of said process groups containing 
one or more processes that have a fault recovery 
strategy common to said at least one of said process 
groups; 

b. running, on at least one of said at least two computers, 
a process group manager that initiates a fault recovery 
strategy for at least one of said one or more process 
groups; 

c. providing, on at least a first of said at least two 
computers, an application layer having at least two 
process groups; 

d. defining a dependency by at least a first of said at least 
two application layer process groups upon at least a 
second of said at least two application layer process 
groups; 

e. taking said first and said second application layer 
process groups out of service on said first computer 
upon a fault in said second application layer process 
group on said first computer; 

f. providing at least one paired process group on a second 
of said at least two computers, said paired process 
group being paired with said second application layer 
process group on said first computer; and 

g. activating said paired process group on said second 
computer upon said fault in said second application 
layer process group on said first computer. 

27. The method of claim 26 further comprising, in 
combination, the steps of: 

a. providing, on at least said first computer, a system layer 
having at least one process group; 

b. taking all of said process groups out of service on said 
first computer upon a system layer process group fault 
occurring on said first computer; 

c. re-booting said first computer upon a system layer 
process group fault occurring on said first computer; 
and 

d. re- initializing all of said process groups, except each of 
said at least one process group in said system layer, on 
said first computer upon not being able to take said first 
application layer process group out of service on said 
first computer. 

28. The method of claim 27 further comprising, in 
combination, the step of: re-booting said first computer upon 
failure of said re -initialization to cure said fault in said 
second application layer process group. 

29. The method of claim 28 further comprising, in 
combination, the step of: using an independent computer to 
power cycle said first computer upon failure of said re-boot 
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of said first computer to cure said fault in said second 
application layer process group. 

30. A method for providing high availability applications 
comprising, in combination, the steps of: 

a. running, on at least two computers, one or more process 
groups, at least one of said process groups containing 
one or more processes that have a fault recovery 
strategy common to said at least one of said process 
groups; 

b. running, on at least one of said at least two computers, 
a process group manager that initiates a fault recovery 
strategy for at least one of said one or more process 
groups; 

c. providing, on at least a first of said at least two 
computers, an application layer having at least one 
process group; 

d. taking said at least one application layer process group 
out of service on said first computer upon a fault in a 
resource depended upon by said at least one application 
layer process group; 

e. providing at least one paired process group on a second 
of said at least two computers, said paired process 
group being paired with one of said at least one 
application layer process group on said first computer; 
and 

f. activating said paired process group on said second 
computer upon said fault in said resource depended 
upon by said at least one application layer process 
group. 

31. The method of claim 30 further comprising, in 
combination, the steps of: 

a. providing, on at least said first computer, a system layer 
having at least one process group; 

b. taking all of said process groups out of service on said 
first computer upon a system layer process group fault 
occurring on said first computer, 

c. re-booting said first computer upon a system layer 
process group fault occurring on said first computer; 

d. re -initializing said resource having said fault; and 

e. re-initializing all of said process groups, except each of 
said at least one process group in said system layer, on 
said first computer upon not being able to take said at 
least one application layer process group out of service 
on said first computer. 

32. The method of claim 31 further comprising, in 
combination, the step of: re -booting said first computer upon 
failure of said re-initializations to cure said fault in said 
resource depended upon by said at least one application 
layer process group. 

33. The method of claim 32 further comprising, in 
combination, the step of: using an independent computer to 
power cycle said first computer upon failure of said re-boot 
of said first computer to cure said fault in said resource 
depended upon by said at least one application layer process 
group. 

34. The method of claim 33 further comprising, in 
combination, the step of: using an independent computer to 
power down said first computer upon failure of said power 
cycling of said first computer to cure said fault in said 
resource depended upon by said at least one application 
layer process group. 

35. A method for providing high availability applications 
comprising, in combination, the steps of: 

a. running, on at least two computers, one or more process 
groups, at least one of said process groups containing 
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one or more processes that have a fault recovery 
strategy common to said at least one of said process 
groups; 

b. running, on at least one of said at least two computers, 
5 a process group manager that initiates a fault recovery 

strategy for at least one of said one or more process 
groups; 

c. providing, on at least said first computer, a system layer 
having at least one process group; 

10 d. taking all of said process groups out of service on said 

first computer upon a system layer process group fault 

occurring on said first computer; 
e. re-booting said first computer upon a system layer 

process group fault occurring on said first computer; 
15 f. providing, on at least a first of said at least two 

computers, an application layer having at least one 

process group; 

g. re-starting said at least one application layer process 
group on said first computer upon a fault in said at least 

20 one application layer process group; 

h. taking said at least one application layer process group 
out of service on said first computer upon failure of said 
re-start to cure said fault in said at least one application 
layer process group; 

25 i. providing at least one paired process group on a second 
of said at least two computers, said paired process 
group being paired with one of said at least one 
application layer process group taken out of service on 
said first computer; 
j. activating said paired process group on said second 

30 computer upon failure of said re-start to cure said fault 
in said at least one application layer process group 
taken out of service on said first computer; and 
k. re-initializing all of said process groups, except each of 
said at least one process group in said system layer, on 

35 said first computer upon not being able to take said 
application layer process group having said fault out of 
service on said first computer. 

36. The method of claim 35 further comprising, in 
combination, the step of: re -booting said first computer upon 

40 failure of said re-initialization to cure said application layer 
process group fault. 

37. The method of claim 36 further comprising, in 
combination, the step of: using an independent computer to 
power cycle said first computer upon failure of said re-boot 

45 of said first computer to cure said application layer process 
group fault. 

38. The method of claim 37 further comprising, in 
combination, the step of: using an independent computer to 
power down said first computer upon failure of said power 

50 cycling of said first computer to cure said application layer 
process group fault. 

39. The method of claim 38 further comprising, in 
combination, the step of: using an independent computer to 
power down said first computer upon failure of said power 

55 cycling of said first computer to cure said fault in said 
second application layer process group. 

40. A method for providing high availability applications 
comprising, in combination, the steps of: 

a. running, on at least two computers, one or more process 
60 groups, at least one of said process groups containing 

one or more processes that have a fault recovery 
strategy common to said at least one of said process 
groups; 

b. running, on at least one of said at least two computers, 
65 a process group manager that initiates a fault recovery 

strategy for at least one of said one or more process 
groups; 
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c. providing, on at least a first of said at least two d. 
computers, an application layer having at least two 
process groups; 

d. defining a dependency by at least a first of said at least e- 
two application layer process groups upon at least a 5 
second of said at least two application layer process 
groups; 

e. re-starting said second application layer process group 
on said first computer upon a fault in said second 
application layer process group; 

f. taking said first and said second application layer 
process groups out of service on said first computer 
upon failure of said re-start to cure said fault in said 
second application layer process group; 

g. providing at least one paired process group on a second 
of said at least two computers, said paired process 
group being paired with said first application layer 
process group on said first computer; and 

h. activating said at least one paired process group on said 
second computer upon failure of said re-start to cure 
said fault in said second application layer process 
group. 

41. The method of claim 40 further comprising, in 
combination, the steps of: 

a. providing, on at least said first computer, a system layer 
having at least one process group; 

b. taking all of said process groups out of service on said 
first computer upon a system layer process group fault 
occurring on said first computer, 

c. re-booting said first computer upon a system layer 
process group fault occurring on said first computer; 
and 

d. re-initializing all of said process groups, except each of 
said at least one process group in said system layer, on 
said first computer upon not being able to take said first 
application layer process group out of service on said 
first computer. 

42. The method of claim 41 further comprising, in 
combination, the step of: re-booting said first computer upon 
failure of said re -initialization to cure said fault in said 
second application layer process group. 

43. The method of claim 42 further comprising, in 
combination, the step of: using an independent computer to 
power cycle said first computer upon failure of said re-boot 
of said first computer to cure said fault in said second 
application layer process group. 

44. The method of claim 43 further comprising, in 
combination, the step of: using an independent computer to 
power down said first computer upon failure of said power 
cycling of said first computer to cure said fault in said 
second application layer process group. 

45. An apparatus for providing high availability applica- 
tions comprising, in combination: 

a. means for running, on at least two computers, one or 
more process groups, at least one of said process groups 
containing one or more processes that have a fault 
recovery strategy common to said at least one of said 
process groups; 

b. means for running, on at least one of said at least two 
computers, a process group manager that initiates a 
fault recovery strategy for at least one of said one or 
more process groups; 

c. means for providing, on at least a first of said at least 
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d. means for taking all of said process groups out of 
service on said first computer upon a system layer 
process group fault occurring on said first computer; 

e. means for re-booting said first computer upon a system 
layer process group fault occurring on said first com- 
puter; 

f. means for providing at least one paired process group 
on a second of said at least two computers, said paired 
process group being paired with one of said one or 
more process groups on said first computer; and 

g. means for activating said paired process group on said 
second computer upon a system layer process group 
fault occurring on said first computer. 

46. The apparatus of claim 45 further comprising, in 
combination: means for using an independent computer to 
power cycle said first computer upon failure of said 
re-booting of said first computer to cure said system layer 
process group fault. 

47. The apparatus of claim 46 further comprising, in 
combination: means for using an independent computer to 
power down said first computer upon failure of said power 
cycling of said first computer to cure said system layer 
process group fault. 

48. An apparatus for providing high availability applica- 
tions comprising, in combination: 

a. means for running, on at least two computers, one or 
more process groups, at least one of said process groups 
containing one or more processes that have a fault 
recovery strategy common to said at least one of said 
process groups; 

b. means for running, on at least one of said at least two 
computers, a process group manager that initiates a 
fault recovery strategy for at least one of said one or 
more process groups; 

c. means for providing, on at least a first of said at least 
two computers, a system layer having at least one 
process group; 

d. means for taking all of said process groups out of 
service on said first computer upon a fault in a resource 
depended upon by at least one of said system layer 
process groups on said first computer; 

e. means for re-booting said first computer upon said fault 
in said resource depended upon by at least one of said 
system layer process groups on said first computer; 

f. means for providing at least one paired process group 
on a second of said at least two computers, said paired 
process group being paired with one of said process 
groups on said first computer; and 

g. means for activating said paired process group on said 
second computer upon said fault in said resource 
depended upon by at least one of said system layer 
process groups on said first computer. 

49. The apparatus of claim 48 further comprising, in 
combination: means for using an independent computer to 
power cycle said first computer upon failure of said 
re-booting of said first computer to cure said fault in said 
resource depended upon by at least one of said system layer 
process groups on said first computer. 

50. The apparatus of claim 49 further comprising, in 
combination: means for using an independent computer to 
power down said first computer upon failure of said power 
cycling of said first computer to cure said fault in said 
resource depended upon by at least one of said system layer 
process groups on said first computer. 

51. An apparatus for providing high availability applica- 
tions comprising, in combination: 
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a. means for running, on al least two computers, one or 
more process groups, at least one of said process groups 
containing one or more processes that have a fault 
recovery strategy common to said at least one of said 
process groups; 

b. means for running, on at least one of said at least two 
computers, a process group manager that initiates a 
fault recovery strategy for at least one of said one or 
more process groups; 

c. means for providing, on at least a first of said at least 
two computers, a system layer having at least one 
process group; 

d. means for taking all of said process groups out of 
service on said first computer upon a system layer 
process group fault occurring on said first computer; 

e. means for re-booting said first computer upon a system 
layer process group fault occurring on said first com- 
puter; 

f. means for providing, on at least said first computer, a 
platform layer having al least one process group; 

g. means for taking all of said process groups, except each 
of said at least one process group in said system layer, 
out of service on said first computer upon a platform 
layer process group fault occurring on said first com- 
puter; 

h. means for providing at least one paired process group 
on a second of said at least two computers, said paired 
process group being paired with one of said process 
groups on said first computer; and 

L means for activating said paired process group on said 
second computer upon said platform layer process 
group fault occurring on said first computer. 

52. The apparatus of claim 51 further comprising, in 
combination: means for re-initializing all of said process 
groups, except each of said at least one process group in said 
system layer, on said first computer upon said platform layer 
process group fault occurring on said first computer. 

53. The apparatus of claim 52 further comprising, in 
combination: means for re-booting said first computer upon 
failure of said re-initialization to cure said platform layer 
process group fault. 

54. The apparatus of claim 53 further comprising, in 
combination: means for using an independent computer to 
power cycle said first computer upon failure of said 
re-booting of said first computer to cure said platform layer 
process group fault. 

55. The apparatus of claim 54 further comprising, in 
combination: means for using an independent computer to 
power down said first computer upon failure of said power 
cycling of said first computer to cure said platform layer 
process group fault. 

56. An apparatus for providing high availability applica- 
tions comprising, in combination: 

a. means for running, on at least two computers, one or 
more process groups, at least one of said process groups 
containing one or more processes that have a fault 
recovery strategy common to said at least one of said 
process groups; 

b. means for running, on at least one of said at least two 
computers, a process group manager that initiates a 
fault recovery strategy for at least one of said one or 
more process groups; 

c. means for providing, on at least a first of said at least 
two computers, a system layer having at least one 
process group; 
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d. means for taking all of said process groups out of 
service on said first computer upon a system layer 
process group fault occurring on said first computer; 

e. means for re-booting said first computer upon a system 
layer process group fault occurring on said first com- 
puter; 

f. means for providing, on at least a first of said at least 
two computers, a platform layer having at least one 
process group; 

g. means for taking all of said process groups, except each 
of said at least one process group in said system layer, 
out of service on said first computer upon a fault in a 
resource depended upon by at least one of said platform 
layer process groups on said first computer; 

h. means for providing at least one paired process group 
on a second of said at least two computers, said at least 
one paired process group being paired with one of said 
process groups on said first computer; and 

i. means for activating said at least one paired process 
group on said second computer upon said fault in said 
resource depended upon by at least one of said platform 
layer process groups on said first computer. 

57. The apparatus of claim 56 further comprising, in 
combination: 

a. means for re-initializing said resource having said fault; 
and 

b. means for re-initializing all of said process groups, 
except each of said at least one process group in said 
system layer, on said first computer upon said fault in 
said resource depended upon by at least one of said 
platform layer process groups on said first computer. 

58. The apparatus of claim 57 further comprising, in 
combination: means for re-booting said first computer upon 
failure of said re-initializations to cure said fault in said 
resource depended upon by at least one of said platform 
layer process groups on said first computer. 

59. The apparatus of claim 58 further comprising, in 
combination: means for using an independent computer to 
power cycle said first computer upon failure of said 
re -booting of said first computer to cure said fault in said 
resource depended upon by at least one of said platform 
layer process groups on said first computer. 

60. The apparatus of claim 59 further comprising, in 
combination: means for using an independent computer to 
power down said first computer upon failure of said power 
cycling of said first computer to cure said fault in said 
resource depended upon by at least one of said platform 
layer process groups on said first computer. 

61. An apparatus for providing high availability applica- 
tions comprising, in combination: 

a. means for running, on at least two computers, one or 
more process groups, at least one of said process groups 
containing one or more processes that have a fault 
recovery strategy common to said al least one of said 
process groups; 

b. means for running, on at least one of said a I least two 
computers, a process group manager that initiates a 
fault recovery strategy for at least one of said one or 
more process groups; 

c. means for providing, on at least a first of said at least 
two computers, a system layer having at least one 
process group; 

d. means for taking all of said process groups out of 
service on said first computer upon a system layer 
process group fault occurring on said first computer; 
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e. means for re -booting said first computer upon a system 
layer process group fault occurring on said first com- 
puter; 

f. means for providing, on at least a first of said at least 
two computers, a platform layer having at least one 
process group; 

g. means for restarting at least one of said platform layer 
process groups upon a platform layer process group 
fault occurring on said first computer; 

h. means for taking all of said process groups, except each 
of said at least one process group in said system layer, 
out of service on said first computer upon failure of said 
re-start to cure said platform layer process group fault; 

L means for providing at least one paired process group on 
a second of said at least two computers, said at least one 
paired process group being paired with one of said 
process groups on said first computer, and 

j. means for activating said at least one paired process 
group on said second computer upon failure of said 
re -start to cure said platform layer process group fault. 

62. The apparatus of claim 61 further comprising, in 
combination: means for re-initializing all of said process 
groups, except each of said at least one process group in said 
system layer, on said first computer upon failure of said 
re-start to cure said platform layer process group fault. 

63. The apparatus of claim 62 further comprising, in 
combination: means for re-booting said first computer upon 
failure of said re-initialization to cure said platform layer 
process group fault. 

64. The apparatus of claim 63 further comprising, in 
combination: means for using an independent computer to 
power cycle said first computer upon failure of said 
re-booting of said first computer to cure said platform layer 
process group fault. 

65. The apparatus of claim 64 further comprising, in 
combination: means for using an independent computer to 
power down said first computer upon failure of said power 
cycling of said first computer to cure said platform layer 
process group fault. 

66. An apparatus for providing high availability applica- 
tions comprising, in combination: 

a. means for running, on at least two computers, one or 
more process groups, at least one of said process groups 
containing one or more processes that have a fault 
recovery strategy common to said at least one of said 
process groups; 

b. means for running, on at least one of said at least two 
computers, a process group manager that initiates a 
fault recovery strategy for at least one of said one or 
more process groups; 

c. means for providing, on at least said first computer, a 
system layer having at least one process group; 

d. means for taking all of said process groups out of 
service on said first computer upon a system layer 
process group fault occurring on said first computer; 

e. means for re-booting said first computer upon a system 
layer process group fault occurring on said first com- 
puter; 

f. means for providing, on at least a first of said at least 
two computers, an application layer having at least one 
process group; 

g. means for taking said at least one application layer 
process group out of service on said first computer 
upon a fault in said at least one application layer 
process group on said first computer; 
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h. means for providing at least one paired process group 
on a second of said at least two computers, said paired 
process group being paired with one of said at least one 
application layer process group taken out of service on 
said first computer; 

i. means for activating said paired process group on said 
second computer upon said fault in said at least one 
application layer process group on said first computer; 
and 

j. means for re-initializing all of said process groups, 
except each of said at least one process group in said 
system layer, on said first computer upon not being able 
to take said application layer process group having said 
fault out of service on said first computer. 

67. The apparatus of claim 66 further comprising, in 
combination: means for re-booting said first computer upon 
failure of said re- initialization to cure said application layer 
process group fault. 

68. The apparatus of claim 67 further comprising, in 
combination: means for using an independent computer to 
power cycle said first computer upon failure of said re-boot 
of said first computer to cure said application layer process 
group fault. 

69. The apparatus of claim 68 further comprising, in 
combination: means for using an independent computer to 
power down said first computer upon failure of said power 
cycling of said first computer to cure said application layer 
process group fault. 

70. An apparatus for providing high availability applica- 
tions comprising, in combination: 

a. means for running, on at least two computers, one or 
more process groups, at least one of said process groups 
containing one or more processes that have a fault 
recovery strategy common to said at least one of said 
process groups; 

b. means for running, on at least one of said at least two 
computers, a process group manager that initiates a 
fault recovery strategy for at least one of said one or 
more process groups; 

c. means for providing, on at least a first of said at least 
two computers, an application layer having at least two 
process groups; 

d. means for defining a dependency by at least a first of 
said at least two application layer process groups upon 
at least a second of said at least two application layer 
process groups; 

e. means for taking said first and said second application 
layer process groups out of service on said first com- 
puter upon a fault in said second application layer 
process group on said first computer; 

f. means for providing at least one paired process group 
on a second of said at least two computers, said paired 
process group being paired with said second applica- 
tion layer process group on said first computer; and 

g. means for activating said paired process group on said 
second computer upon said fault in said second appli- 
cation layer process group on said first computer. 

71. The apparatus of claim 70 further comprising, in 
combination: 

a. means for providing, on at least said first computer, a 
system layer having at least one process group; 

b. means for taking all of said process groups out of 
service on said first computer upon a system layer 
process group fault occurring on said first computer; 

c. means for re-booting said first computer upon a system 
layer process group fault occurring on said first com- 
puter; and 



05/05/2004, EAST Version: 1.4.1 



6,058,490 



25 



26 



10 



15 



d. means for re- initializing ail of said process groups, 
except each of said at least one process group in said 
system layer, on said first computer upon not being able 
to take said first application layer process group out of 
service on said first computer. . 

72. The apparatus of claim 71 further comprising, in 
combination: means for re-booting said first computer upon 
failure of said re-initialization to cure said fault in said 
second application layer process group. 

73. The apparatus of claim 72 further comprising, in 
combination: means for using an independent computer to 
power cycle said first computer upon failure of said re-boot 
of said first computer to cure said fault in said second 
application layer process group. 

74. The apparatus of claim 73 further comprising, in 
combination: means for using an independent computer to 
power down said first computer upon failure of said power 
cycling of said first computer to cure said fault in said 
second application layer process group. 

75. An apparatus for providing high availability applica- 20 
lions comprising, in combination: 

a. means for running, on at least two computers, one or 
more process groups, at least one of said process groups 
containing one or more processes that have a fault 
recovery strategy common to said at least one of said 25 
process groups; 

b. means for running, on at least one of said at least two 
computers, a process group manager thai initiates a 
fault recovery strategy for at least one of said one or 
more process groups; 

c. means for providing, on at least a first of said at least 
two computers, an application layer having at least one 
process group; 

d. means for taking said at least one application layer 
process group out of service on said first computer 
upon a fault in a resource depended upon by said at 
least one application layer process group; 

e. means for providing at least one paired process group 
on a second of said at least two computers, said paired 40 
process group being paired with one of said at least one 
application layer process group on said first computer; 
and 

f. means for activating said paired process group on said 
second computer upon said fault in said resource 45 
depended upon by said at least one application layer 
process group. 

76. The apparatus of claim 75 further comprising, in 
combination: 

a. means for providing, on at least said first computer, a 
system layer having at least one process group; 

b. means for taking all of said process groups out of 
service on said first computer upon a system layer 
process group fault occurring on said first computer; 

c. means for re -booting said first computer upon a system 
layer process group fault occurring on said first com- 
puter; 

d. means for re-initializing said resource having said fault; 
and 

e. means for re-initializing all of said process groups, 
except each of said at least one process group in said 
system layer, on said first computer upon not being able 
to take said at least one application layer process group 
out of service on said first computer. 

77. The apparatus of claim 76 further comprising, in 
combination: means for re-booting said first computer upon 
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failure of said re -initializations to cure said fault in said 
resource depended upon by said at least one application 
layer process group. 

78. The apparatus of claim 77 further comprising, in 
combination: means for using an independent computer to 
power cycle said first computer upon failure of said re-boot 
of said first computer to cure said fault in said resource 
depended upon by said at least one application layer process 
group. 

79. The apparatus of claim 78 further comprising, m 
combination: means for using an independent computer to 
power down said first computer upon failure of said power 
cycling of said first computer to cure said fault in said 
resource depended upon by said at least one application 
layer process group. 

80. An apparatus for providing high availability applica- 
tions comprising, in combination: 

a. means for running, on at least two computers, one or 
more process groups, at least one of said process groups 
containing one or more processes that have a fault 
recovery strategy common to said at least one of said 
process groups; 

b. means for running, on at least one of said at least two 
computers, a process group manager that initiates a 
fault recovery strategy for at least one of said one or 
more process groups; 

c. means for providing, on at least said first computer, a 
system layer having at least one process group; 

d. means for taking all of said process groups out of 
service on said first computer upon a system layer 
process group fault occurring on said first computer; 

e. means for re-booting said first computer upon a system 
layer process group fault occurring on said first com- 
puter; 

f. means for providing, on at least a first of said at least 
two computers, an application layer having at least one 
process group; 

g. means for re-starting said at least one application layer 
process group on said first computer upon a fault in said 
at least one application layer process group; 

h. means for taking said at least one application layer 
process group out of service on said first computer 
upon failure of said re-start to cure said fault in said at 
least one application layer process group; 

i. means for providing at least one paired process group on 
a second of said at least two computers, said paired 
process group being paired with one of said at least one 
application layer process group taken out of service on 
said first computer; 

j. means for activating said paired process group on said 
second computer upon failure of said re-start to cure 
said fault in said at least one application layer process 
group taken out of service on said first computer; and 

k. means for re-initializing all of said process groups, 
except each of said at least one process group in said 
system layer, on said first computer upon not being able 
to take said application layer process group having said 
fault out of service on said first computer. 

81. The apparatus of claim 80 further comprising, in 
combination: means for re-booting said first computer upon 
failure of said re-initialization to cure said application layer 
process group fault. 

82. The apparatus of claim 81 further comprising, in 
combination: means for using an independent computer to 
power cycle said first computer upon failure of said re-boot 
of said first computer to cure said application layer process 
group fault. 
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83. The apparatus of claim 82 further comprising, in 
combination: means for using an independent computer to 
power down said first computer upon failure of said power 
cycling of said first computer to cure said application layer 
process group fault. 

84. A method for providing high availability applications 
comprising, in combination: 

a. means for running, on at least two computers, one or 
more process groups, at least one of said process groups 
containing one or more processes that have a fault 
recovery strategy common to said at least one of said 
process groups; 

b. means for running, on at least one of said at least two 
computers, a process group manager that initiates a 
fault recovery strategy for at least one of said one or 
more process groups; 

c. means for providing, on at least a first of said at least 
two computers, an application layer having at least two 
process groups; 

d. means for defining a dependency by at least a first of 
said at least two application layer process groups upon 
at least a second of said at least two application layer 
process groups; 

e. means for re-starting said second application layer 
process group.on said first computer upon a fault in said 
second application layer process group; 

f. means for taking said first and said second application 
layer process groups out of service oo said first com- 
puter upon failure of said re-start to cure said fault in 
said second application layer process group; 

g. means for providing at least one paired process group 
on a second of said at least two computers, said paired 
process group being paired with said first application 
layer process group on said first computer; and 
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h. means for activating said at least one paired process 
group on said second computer upon failure of said 
re -start to cure said fault in said second application 
layer process group. 

85. The apparatus of claim 84 further comprising, in 
combination: 

a. means for providing, on at least said first computer, a 
system layer having at least one process group; 

b. means for taking all of said process groups out of 
service on said first computer upon a system layer 
process group fault occurring on said first computer; 

c. means for re-booting said first computer upon a system 
layer process group fault occurring on said first com- 
puter; and 

d. means for re-initializing all of said process groups, 
except each of said at least one process group in said 
system layer, on said first computer upon not being able 
to take said first application layer process group out of 
service on said first computer. 

86. The apparatus of claim 85 further comprising, in 
combination: means for re-booting said first computer upon 
failure of said re-initializalion to cure said fault in said 
second application layer process group. 

87. The apparatus of claim 86 further comprising, in 
combination: means for using an independent computer to 
power cycle said first computer upon failure of said re-boot 
of said first computer to cure said fault in said second 
application layer process group. 

88. The apparatus of claim 87 further comprising, in 
combination: means for using an independent computer to 
power down said first computer upon failure of said power 
cycling of said first computer to cure said fault in said 
second application layer process group. 
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[57] ABSTRACT 

A method and an apparatus for providing scalable layers of 
highly available applications using loosely coupled com- 
mercially available computers. The software running on the 
loosely coupled computers is divided into three layers: the 
system layer, the platform layer, and the application layer, 
each having its own process group activation and fault 
recovery strategy. A process group contains software pro- 
cesses that depend upon a set of resources common to the 
process group. In addition to depending upon a common set 
of resources, processes within a process group share a fault 
recovery strategy. Fault recovery is performed at the process 
group level, such that if one process within a process group 
fails, fault recovery is takes place for all processes within the 
process group. In the preferred embodiment, an application 
layer process group may be paired with another application 
layer process group on a separate computer. As part of 
certain escalated process group fault recovery strategies, 
upon taking an application layer process group out of 
service, its paired application layer process group, if any 
exists, takes over performing the functions of the process 
group that was taken out of service. 

88 Claims, 2 Drawing Sheets 
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METHOD AND APPARATUS FOR The prior art high availability clusters, in trying to provide 

PROVIDING SCALEABLE LEVELS OF different levels of availability, have used operating system- 

APPLICATION AVAILABILITY based clusters to optimize the unique data and application 

characteristics for a specific targeted commercial market. 

BACKGROUND OF THE INVENTION 5 s ucn a targeted approach does not lend itself well to certain 

1 Field of the Invention industries, including telecommunications, in which numer- 
The present invention relates to computer system ous le g acv appUcations currently exist, each with unique 

architectures, and more particularly to a reliable cluster recovery and performance characteristics running on pro- 
computing architecture that provides scalable levels of high P netar y hardware > ^ of wmch 15 fault tolerant * 
availability applications simultaneously across commer- 10 Therefore, a computing system architecture that provides 
cially available computing elements. varying levels of high availability applications simulta- 

2 Statement of Related Art neously across one or more loosely coupled commercially 

... ...... * t » . „ available computing elements using a commercially avail - 

Prior art high availability clustered computer systems are , . . 4 . • j • L i 

.„ ? u*. u ■ u a u ■ able mterconnect is desirable, 

typically configured in an architecture having shared pnysi- 1 , 

cal storage devices, such as a shared disk. Therefore, prior ^ P rior art hi S h availability cluster solutions have the 
art cluster offerings are typically based on physical capability to support "heartbeats" and recovery of a speci- 
hardware, or clustered arrangements of systems and storage, fied application. The most significant architectural differ- 
particularly adapted to a unique application processing envi- ence between the prior art solutions is the method for 
ronment. In a common type of prior art high availability 9n determining how an application and/or computing system is 
cluster, all of the critical application data must reside on an chosen or controlled to be active or standby and the method 
external shared disk, or on a pool of disks, that is accessible determining when they will be allowed access to the 
from at most one computing system in the cluster. Such a application data. Typical physical high availability cluster 
prior art cluster tries to isolate access to the data partitions solutions determine the status of the configuration via a set 
on the disk so that access to the shared disk is limited to only „ °* redundant communication facilities between the pair of 
one computing system at a time. Upon failure of the primary computing systems. Under most circumstances, the paired 
computing system, a takeover occurs whereby the high systems are able to determine which system is active for an 
availability cluster reallocates access to the disk from the application. 

primary computing system to the dedicated backup system. In prior art high availability solutions, when all commu- 

Once such a reallocation is performed, the applications on 30 nication is lost between computing systems, the computing 

that backup system will have access to the disk. systems or clustered applications might each take on an 

Another prior art high availability cluster solution is a active role believing that the other has failed. Such a 

multi-processor cluster. Like the shared-disk cluster, the situation presents an undesirably high risk of application 

multi-processor cluster is a hardware-based cluster arrange- data and processing being corrupted. Several added levels of 

ment of computing systems. Unlike the shared-disk cluster, 35 protection and safety are possible to prevent that from 

in which the computing systems are essentially unrelated to happening. Some solutions in the prior art, nearly eliminate 

each other, the computing systems in a multi-processor this risk using heartbeats through the shared storage. Since 

cluster are all running the same application and using the certain cluster solutions do not need to use shared storage, 

same data at virtually the same time. All physical storage is a platform neutral hardware component is desirable to 

configured to be accessible to all computing systems. Such 40 complement the software-based cluster components. It is 

multi-processor clusters, in an attempt to control access to therefore an object of this invention to provide scaleable 

concurrent data, typically use lock management software to layers of available application processes using 

manage access to data and prevent any data corruption or ^sely ^pled commercially available computing ele- 

integrity problems. The loss of a computing system from a ments. 

multi-processor cluster allows the remaining systems to 45 SUMMARY OF THE INVENTION 
continue processing the data. 

Another prior art high availability cluster solution is a This invention provides a method and an apparatus for 
symmetrical multi-processing, or scalable parallel providing scaleable layers of high availability applications 
processing, cluster based on a shared memory or system bus using loosely coupled commercially available computing 
architecture where the memory is common to multiple 50 elements, also referred to as computers. Computing ele- 
computing systems. Such systems, in an attempt to improve ments refers to any type of processor or any device con- 
performance by scaling the number of computing systems in taining such a processor. 

the symmetrical multi-processing cluster, allow a single Resource dependencies and fault recovery strategies 

computing system failure to cause the entire symmetrical occur at the process group level. For example, a process 

multi-processing or scalable parallel processing cluster plat- 55 group containing three processes might depend upon four 

form to become unavailable. resources, such as other process groups or peripheral 

Yet another high availability cluster architecture is a devices, such as a disk. Upon failure of a single process 

multiple parallel processor cluster, in which each computing within the process group or upon failure of a single resource 

system has its own memory and disk, none of which are depended upon by the process group, fault recovery will be 

shared with any other computing system in the cluster. If one 60 initiated for the entire process group, as a single unit, 

system has data on a disk, and that data is required by Process groups can belong to one of three layers: the 

another computing system, the first computer sends the data system layer, the platform layer, or the application layer. In 

over a high speed network to the other computing system. the preferred embodiment, each layer has a unique process 

Such multiple parallel processor clusters, in an attempt to group activation and fault recovery strategy. In the preferred 

improve performance by allowing multiple computing sys- 65 embodiment, an application layer process group may be 

terns to work concurrently, allow data associated with a paired with another application layer process group on a 

failed computing system to become unavailable. separate computer. As part of certain escalated process group 
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fault recovery strategies, upon taking an application layer 
process group out of service, its paired application layer 
process group, if any exists, takes over performing the 
functions of the process group that was taken out of service. 

Application layer process groups depend upon one or 5 
more platform layer process groups, which depend upon one 
or more system layer process groups, which depend upon the 
hardware of the loosely coupled computer hosting the pro- 
cess groups. 

Upon a system layer process group failure, all process 10 
groups on the host computer are taken out of service, which 
includes activating on another computer or computers any 
application layer process group that is paired with an appli- 
cation layer process group taken out of service, the computer 
hosting the failed system layer process group is re-booted, 15 
and all system layer, platform layer, and application layer 
process groups are re-initialized. 

Upon a platform layer process group failure, the platform 
layer process group may be re-started zero or more times. If 2Q 
re-starting the failed platform layer process group does not 
cure the platform layer process group failure or if the 
platform layer process group is not restartable, all applica- 
tion layer and platform layer process groups on the host 
computer are taken out of service and re-initialized, which 25 
includes activating on another computer or computers any 
application layer process group that is paired with an appli- 
cation layer process group taken out of service on the host 
computer. 

Upon failure of a resource depended upon by a platform 30 
layer process group, all application layer and platform layer 
process groups on the host computer are taken out of service 
and re-initialized, which includes activating on another 
computer or computers any application layer process group 
that is paired with an application layer process group taken 35 
out of service on the host computer. 

Upon failure of an application layer process group, the 
failed application layer process group may be restarted zero 
or more times. If restarting the failed application layer 
process group does not correct the application layer process 40 
group failure or if the failed application layer process group 
is not restartable, then the failed application layer process 
group is taken out of service, which includes activating on 
another computer the application layer process group, if any, 
that is paired with the application layer process group taken 45 
out of service. 

Upon failure of a resource depended upon by an appli- 
cation layer process group, the dependent application layer 
process group is taken out of service, which includes acti- 
vating on another computer the application layer process 50 
group, if any, that is paired with the dependent application 
layer process group taken out of service on the host com- 
puter. 

BRIEF DESCRIPTION OF THE DRAWINGS 55 

FIG. 1 is a block diagram illustration of the subject 
invention, including four loosely coupled computers, an 
independent computer, and a maintenance terminal. 

FIG. 2 is a state diagram illustrating the sequence of states 60 
that a process group in the subject invention can transition 
through. 

DETAILED DESCRIPTION OF THE 
PREFERRED EMBODIMENT 

65 

FIG. 1 shows the preferred embodiment of the subject 
invention, including a maintenance terminal (MT) 2, an 
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independent computer (IC) 4, four industry standard com- 
mercially available computing elements 6 (also referred to 
as computers 6) loosely coupled together through an inter- 
connect 8, such as a network, a computer bus architecture, 
and the like. Computing elements 6, also referred to as 
computers 6, could be any type of processor or any device 
containing such a processor. 

Each computer 6 is running process group management 
software 10, also referred to collectively as the process 
group manager. The process group manager 10 activates 
process groups and initiates fault recovery strategies at the 
process group level. A process group is a group of processes, 
which are typically implemented in software, that are related 
to each other in some way such that it is desirable to manage 
the process group as a single unit. It may be desirable to 
restart all of the processes within a process group together, 
or it may be desirable, as part of an escalated fault recovery 
strategy, to have the functionality of all of the processes in 
the process group performed by a process group on a 
separate computer. The process group might, but does not 
have to, depend upon a resource, or a set of resources 
common to the process group. 

The independent computer 4 is preferably a computing 
device designed to have a minimal number of faults over an 
extended period of time. The independent computer 4 moni- 
tors computers 6 for hardware faults using heartbeats, as 
disclosed in commonly assigned U.S. Pat. No. 5,560,033. 
Each of the loosely coupled computers 6 is coupled to the 
centralized computer 4 as shown with reference numeral 12 
in FIG. 1. 

Process groups may belong to one of three layers: the 
system layer 13, the platform layer 15, and the application 
layer 17. Each computer 6 is shown running process group 
management (PGM) software 10, one system layer process 
group (SLPG) 14, one platform layer process group (PLPG) 
16, either one or two application layer process groups 
(ALPG) 18 that are not of a process-group pair (PGP) 24, 
one primary application layer process group (P-ALPG) 20 
that is part of a process-group pair 24, and one alternate 
application layer process group (A-ALPG) 22 that is part of 
a process-group pair 24. By definition: (1) each process- 
group pair 24 contains one primary process group 20 on a 
first computer 6 and one alternate process group 22 on a 
second computer 6; (2) each primary process group 20, and 
each alternate process group 22, is part of a process-group 
pair; and (3) process groups 14, 16, and 18 are not part of a 
process-group pair 24. In the preferred embodiment, 
process-group pairs may belong only to the application layer 
17; however, process-group pairs could be provided at the 
system layer 13, the platform layer 15, or the system layer 
13, or at all of these layers, without departing from the scope 
of this invention. 

Although each computer 6 in the preferred embodiment is 
running the number of certain types of process groups 
mentioned above; without departing from the scope of this 
invention, each computer 6 could be running: (1) zero or 
more process groups 14, 16, and 18; (2) zero or more 
primary process groups 20; and (3) zero or more alternate 
process groups 22. 

Similarly, the number of computers 6 can be two or more 
without departing from the scope of this invention, and, 
although the process group management software 10 is 
shown running on all four of the loosely coupled computers 
6, it could be running on any permutation or combination of 
the loosely coupled computers 6 and/or the independent 
computer 4 without departing from the scope of this inven- 
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tion. Further, the functions performed by the independent words, a human operator must enter a command from the 

computer 4 could be performed using any type of device local maintenance terminal 2 to put one or more process 

capable of running any type of processor, such as a periph- groups 14, 16, 18, 20, and 22 into the Off-line state 38 or to 

eral board with a digital signal processor, a system board remove one or more process groups 14, 16, 18, 20, and 22 

with a general purpose processor, a fault tolerant processor 5 from the Off-line state 38. The Off-line state 38 may be 

and the like, without departing from the scope of this entered under circumstances other than by manual 

invention. operation, such as upon a command from the process group 

The independent computer 4 uses industry -standard inter- manager 10, without departing from the scope of this 

faces 12, such as RS-232 in the preferred embodiment, invention. 

although any interface could be used. This provides the 1Q Assuming there are no resource or process group faults 

capability of leveraging general networking interfaces, such and neither the primary nor the alternate process groups 20 

as Ethernet, Fiber Distributed Data Interface (FDDI), Asyn- and 22 have been manually transitioned to the Off-line state 

chronous Transfer Mode (ATM), and protocols, such as 38, process-group pairs 24 contain a primary process group 

Transmission Control Protocol/Internet Protocol (TCP/IP), 20 and an alternate process group 22 in an Active 

or Peripheral Component Interconnect (PCI). 36/Standby 34 paired relationships: active/cold -standby or 

Each loosely coupled computer 6 typically contains a active/hot-standby. In the preferred embodiment, the pri- 

uni-processor, multi-processor, or fault tolerant processor mary process group 20 is initialized to the Active state 36 

system having an operating environment that is the same as, and the alternate process group 22 is initialized to the 

or different from, the operating environment of each of the Standby state 34. 

other loosely coupled computers 6. In other words, separate 2 q Although primary process groups 20 are initialized to 

computers 6 can run different operating systems, for Active 36 and alternate process groups 22 are initialized to 

instance, WINDOWS as opposed to UNIX, different oper- Standby 34, under certain conditions, an alternate process 

ating environments, for instance, real-time as opposed to group 22 can be Active 36 while its paired primary process 

non-real-time, and have different numbers and types of group 20 is Standby 34. For example, if a fault occurs in a 

processors. Each of the loosely coupled computers 6 can be 2 s primary process group 20 contained in a process-group pair 

either located at the same site or geographically separated 24 having an Active 36/Standby 34 paired relationship, the 

and connected via a network 8, such as a local area network process group manager 10 will transition that primary pro- 

(LAN) or a wide area network (WAN). cess group 20 from Active 36 to Unavailable 30 and tran- 

As previously mentioned, each process group 14, 16, 18, sition the alternate process group 22 from Standby 34 to 

20, or 22 contains one or more processes that may depend 30 Active 36. Once the fault that caused the primary process 

on a set of resources common to the process group 14, 16, group 20 to be transitioned to Unavailable 30 has been 

18, 20, or 22. For example, a set of such resources could corrected, the primary process group 20 will be transitioned 

include a computer hardware peripheral device, or another to Standby 34 until some event, such as an alternate process 

process group 14, 16, or 18, 20, or 22, a communication link, group fault occurs to cause the alternate process group 22 to 

available disk space, or anything that might affect the 35 transition to Unavailable 30, at which time the primary 

availability of an external application. Each alternate pro- process group 20 will be transitioned from Standby 34 to 

cess group 22 depends upon a set of resources that is Active 36, A process-group pair's primary and alternate 

functionally equivalent to the set of resources depended process groups 20 and 22 can be switched from Active 

upon by the alternate-process-group's paired primary pro- 36/Standby 34 to Standby 34/Active 36 manually or under 

cess group 20. The set of resources depended upon by a 40 circumstances in which switching the Active 36 process 

primary process group 20 and the separate set of resources group 20 or 22 to Standby 34 and the Standby 34 process 

depended upon by its paired alternate process group 22 do group 20 or 22 to Active 36 is desirable, 

not, however, have to contain the same number of resources. The availability of an application layer process group 18, 

Each of the processes contained within a process group 14, 20 or 22 typically depends upon the availability of one or 

16, 18, 20, or 22 also has an activation and fault recovery 45 more platform layer process groups 16. The availability of a 

strategy common to the process group 14, 16, 18, 20, or 22. platform layer process groups 16 typically depends upon the 

FIG. 2 shows the states through which every process availability of one or more system layer process groups 14, 

group 14, 16, 18, 20, and 22 may transition, namely, and the availability of system layer process groups 14 

Unavailable (Unavail) 30, Initialization (Init) 32, Standby depends upon the availability of the hardware of the com- 

34, Active 36, and Off-line 38. Process groups in the 50 puter 6 hosting the system layer process groups 14. System 

Unavailable 30 and Initialization 32 states have not been layer process groups 14 are initialized before platform layer 

started and, therefore, are not running. Process groups in the process groups 16, which are initialized before application 

Active state 36 have been started and are running. Whether layer process groups 18, 20, and 22. 

a process group in the Standby state 34 has been started and This invention provides the flexibility to implement exter- 

is running depends upon whether the process group is a 55 nal applications using various numbers of process groups 

hot-standby or a cold-standby process group. Hot-standby 14, 16, 18, 20, and 22, and/or process-group pairs 24 spread 

process groups in the Standby 34 state have been started and across two or more computers 6. For instance, the four 

are waiting to be activated. Cold standby process groups in process-group pairs 24 shown in FIG. 1 could be part of one 

the Standby 34 state are not started until they are activated. external application, or they could be part of four, or three, 

Activation of cold standby process groups may also involve 60 or two, separate external applications. In addition, two of the 

initializing any uninitialized resources depended upon by the process group pairs 24 shown in FIG. 1 could be part of the 

Cold-standby process group. In a non-fault condition, pri- same external application, yet be hosted by two separate 

mary process groups 20 are initialized to run in the Active pairs of computers 6. Further, one or more process-group 

state 36, and alternate process groups 22 are initialized to the pairs 24 and/or one or more process groups 14, 16, 18, 20, 

Standby state 34. 65 or 22 may depend upon the same resource as one or more 

In the preferred embodiment, the Off-line state 38 can other process-group pairs 24 and/or one or more process 

only be entered and exited by manual operation. In other groups 14, 16, 18, 20, or 22, such that the failure of a single 
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resource could cause a plurality of process groups' and/or a If re-starting the failed platform layer process group 16 

plurality of process-group pairs' fault recovery strategies to does not correct the platform layer process failure, fault 

be performed. recovery is escalated to: (1) taki n g all application layer 

Each layer, system 13, platform 15, and application 17, process groups out of service, including activating any 

has unique process group activation and fault recovery 5 process groups 20 or 22 on separate computers that are 

strategy. In the preferred embodiment, the system layer running Standby 34 and that are paired with an application 

contains non-restartable non-relocatable process groups 14. layer process group 20 or 22 taken out of service, if any such 

The platform layer may contain either, or both, of two types paired Standby process groups 20 or 22 exist; and (2) 

of process groups. Platform layer process groups 16 may be re-initializing all platform layer process groups 16 on the 

either: (1) non-restartable and non-relocatable; or (2) restart- 1Q computer hosting the failed platform layer process. If the 

able and non- re local able. The application layer 17 may failed platform layer process and the process group in which 

contain any, or all, of the following three types of process it is contained are both non-restartable, then this escalated 

groups: (1) non-restartable and non-relocatable process fault recovery strategy is the initial fault recovery action 

groups 18; (2) restartable and non-relocatable process ^his escalated platform layer process group fault 

groups 18; and (3) restartable and relocatable process groups recovery procedure is also implemented upon detection of a 

20 and 22. Relocatable refers to relocating performance of fault in a resource dep ended upon by a platform layer 

the functionality of a primary process group 20 or an $& lfi ^ the added st of re _ initializing the 

alternate process group 22 from one computer 6 to another resQurce fof whicfa a &uh wa& detected If ^tializing all 

computer 6, rather than any type of re-location within the tf ^ lfi Qn ^ 6 hogti 

same computer 6. Primary process groups 20 and alternate f, f.,/,^ 1 /«u ♦• 

-»i 1 1 . ui % *a ac nn the failed platform layer process or resource ( hosting 

process groups 22 are relocatable. Process groups 14, 16, 20 f J * v ° 

and 18 are not relocatable. As previously mentioned, in the computer* ) does not aire the failure, hosting computer 6 is 

preferred embodiment, a primary process group 20 and a re-booted, thereby causing all process groups 14, 16, 18, 20, 

secondary process group 22 can belong to only the appli- and 22 on hosting computer 6 to be re-started. In the 

cation layer 17. Nevertheless, it will be obvious to those preferred embodiment, hosting computer 6 can be re-booted 

having ordinary skill in the art that primary and alternate 2 5 a predetermined number of times over a pre-determined 

process groups 20 and 22 could belong to the platform time period. If re-booting hosting computer 6 does not cure 

and/or system layers and that other permutations and com- the failure, independent computer 4 will then power cycle 

binations of layer-based fault recovery and process group hosting computer 6, thereby re-booting hosting computer 6. 

activation strategies and restartability and relocatability can In the preferred embodiment, hosting computer 6 can be 

be implemented without departing from the scope of this 3Q power cycled a pre-determined number of times over a 

invention. pre-determined time period. If power cycling hosting com- 

Upon a system layer process group failure: (1) all system puter 6 does not does not clear the platform layer process 

layer, platform layer, and application layer process groups gr0 up failure, independent computer 4 will cut off power to 

14, 16, 18, 20, and 22 on the computer 6 hosting the failed hosting computer 6, which will remain in a powered down 

system layer process group ("host computer") are taken out state 

of service by transitioning them to the Unavailable state 30; UpQQ failufe of ft %% ^ aQ a lication layer process 

(2) for each primary and alternate process group 20 and 22 lg ^ of 22 ^ Med ^ ^ ^ 

that is transitioned &om / ct 7J*^ or more times. In the preferred embodiment, re-startable 

host computer, its paired process group 20 or 22 is activated . . r , . 9 . , 

on a separate computer 6 by transitioning the paired process application layer processes may be re-s arted a pre- 

group 20 or 22 from Standby 34 to Active 36; and (3) the 40 determined number of times over a predetermined time 

computer 6 that is hosting the failed system layer process period. 

group 14 is rebooted. In the preferred embodiment, Upon failure of re-starting the failed application layer 

re-booting can be done a pre-determined number of times process to cure the failure, the process group 18, 20 or 22 

over a pre-determined time period. If re-booting host com- containing the failed process may be re-started zero or more 

puter 6 does not clear the fault, the independent computer 4 45 times. In the preferred embodiment, re-startable application 

will then power cycle the host computer 6, thereby layer process groups 18, 20, and 22 may be re -started a 

re-booting the host computer 6. In the preferred pre-determined number of times over a pre-determined time 

embodiment, power cycling can be done a pre-determined period. When such a process group 18, 20 or 22 is restarted, 

number of times over a pre-determined time period. If power it is restarted in its previous running state, 

cycling the host computer does not does not clear the system 50 if the failed application layer process group 18, 20, or 22, 

layer process group failure, the independent computer 4 will & an application layer process group 18, or a primary or 

cut off power to the host computer 6, which will remain in alternate process group 20 or 22 running Active 36 and 

a powered down state. re-starting such a failed process group does not correct the 

Upon a platform layer process failure, the failed process application layer process failure, fault recovery is escalated 

may be re-started zero or more times. In the preferred 55 to: taking such a failed application layer process group out 

embodiment, re-startable platform layer processes may be of service by transitioning it to the Unavailable state 30, and, 

re-started a pre-determined number of times over a pre- for such a primary or alternate process groups 20 or 22, 

determined time period, for instance three re-starts within activating its paired standby process group 20 or 22 on a 

five minutes. separate computer 6. This escalated fault recovery strategy 

Upon failure of re-starting the failed platform layer pro- 60 is the initial fault recovery strategy for process faults of 

cess to cure the failure, the process group 16 containing the non-restartable processes contained within non-restartable 

failed process may be re-started zero or more times. In the application layer process groups 18, 20 and 22. This esca- 

preferred embodiment, re-startable platform layer process lated fault recovery strategy is also used upon detection of 

groups 16 may be re-started a pre-determined number of a fault in a resource depended upon by an application layer 

times over a pre-determined time period. When such a 65 process group 18, 20, or 22 running in the Active state 36, 

process group 16 is restarted, it is restarted in its previous with the added step of re-initializing the resource for which 

running state. a fault was detected. 
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If an application layer process group 18, 20, or 22 cannot 
be taken out of service and transitioned to Unavailable 30, 
fault recovery escalates to re -initializing and re -starting all 
application layer and platform layer process groups 16, 18, 
20, and 22 on the computer 6 hosting the failed application 
layer process group or resource ("computer hosting the 
application layer failure"). In the preferred embodiment, 
such re-initializations and re-starts can be performed a 
pre-determined number of times over a pre-determined 
period on the computer hosting the application layer failure 
6. If re -initializing and re-starting all platform layer and 
application layer process groups 16, 18, 20, and 22 on the 
computer hosting the application layer failure does not cure 
the failure, the computer hosting the application layer failure 
is re-booted, thereby causing all process groups 14, 16, 18, 
20, and 22 on the computer hosting the application layer 
failure to be re -started. In the preferred embodiment, the 
computer hosting the application layer failure can be 
re-booted a pre-determined number of times over a pre- 
determined time period. If re-booting the computer hosting 
the application layer failure does not cure the failure, inde- 
pendent computer 4 will then power cycle the computer 
hosting the application layer failure, thereby re-booting the 
computer hosting the application layer failure. In the pre- 
ferred embodiment, the computer hosting the application 
layer failure can be power-cycled a pre-determined number 
of times over a pre-determined time period. If power cycling 
the computer hosting the application layer failure does not 
clear the failure, independent computer 4 will cut off power 
to the computer hosting the application layer failure, which 
will remain in a powered down state. 

Application layer process groups 18, 20, and 22 may, or 
may not, depend upon application layer process groups 18. 
Therefore, a fault in an application layer process group 18, 
20 or 22 will not affect the availability of any system layer 
process groups 14 or any platform layer process groups 16. 
A fault in an application layer process group 18, 20 or 22 
will also not affect the availability of any application layer 
process groups 18, 20 and 22 that are not dependent upon the 
failed application layer process group 18, 20, or 22. 
However, any application layer process groups 18, 20, and 
22 that are dependent upon an application layer process 
group 18 that is taken out of service, will also be taken out 
service. 

If a failed primary or alternate process group 20 or 22 is 
running Standby 34 and re-starting such a failed process 
group does not correct the failure, fault recovery is escalated 
to: taking such a failed Standby 34 primary or alternate 
process group 20 or 22 out of service by transitioning it to 
the Unavailable state 30. Upon inability to take such a failed 
Standby 34 primary or alternate process group 20 or 22 out 
of service, the previously described escalated fault recovery 
strategy that is implemented upon inability to take any 
application layer process group 18, 20, or 22 out of service 
is implemented. 

Upon detection of a fault in a resource depended upon by 
an application layer process group 20, or 22 running in the 
Standby state 34 or upon detection of a process failure in a 
non-restartable process group 20 or 22 running in the 
Standby state 34, the application layer process group 20 or 
22 that is dependent upon the failed resource, or the process 
group 20 or 22 for which the failure was detected, is 
transitioned to the Unavailable state 30. Upon inability to 
take such a process group 20 or 22 out of service, the 
previously described escalated fault recovery strategy that is 
implemented upon inability to take any application layer 
process group 18, 20, or 22 out of service is implemented. 
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The process group manager 16 can make the state of 
individual process groups 14, 16, 18, 20, and 22 and critical 
resources known to either: (1) only those computer systems 
6 hosting the process groups 14, 16, 18, 20, and 22 to which 
5 the state information pertains, or (2) all of the loosely 
coupled computing systems 6. In addition, such state infor- 
mation can be made available to external application soft- 
ware. 

In the preferred embodiment, transitions between process 

10 group states are controlled by process group management 
software 10 running on each of the loosely coupled com- 
puters 6. Nevertheless, such transitions could be controlled 
by process group management software 10 running on any 
permutation and/or combination of the centralized computer 
4 and the loosely coupled computers 6 without departing 

15 from the scope of this invention. 

System layer process groups 14 can contain either: (1) 
operating system software and services found in normal 
commercially available computing systems; or (2) process 
group management software 10. 

20 Resources depended upon by platform layer process 
groups 16 are initialized before initializing platform layer 
process groups 16. Failure to bring a platform layer resource 
in service results in the host computer 6 being re-booted 
results in the same fault recovery strategy as previously 

25 explained for a fault in a resource depended upon by a 
platform layer process group 16. 

During initialization, platform layer process groups 16 
and application layer process groups 18, 20, and 22 are 
designated runable only when all platform layer resources 

30 and platform layer process groups 16 are designated runable. 
Platform layer process groups 16 handshake with the 
process group manager 10 to control the startup sequence of 
platform layer process groups 16. Similarly, application 
layer process groups 18, 20, and 22 handshake with the 

35 process group manager 10 to control the startup sequence of 
application layer process groups 18, 20, and 22. 

Application layer process groups 18, 20, and 22 can be put 
into the Off-line state 38 individually so that maintenance or 
software updates can be performed on the Off-line process 

40 groups 18, 20, and 22 without impacting other process 
groups on the computer 6 hosting the Off-line process 
groups 18, 20, and 22. 

Primary and alternate process groups 20 and 22 within the 
same process-group pair 24 may have a shared resource 
dependency. Process-group pairs 24 that have an active 

45 cold -standby paired relationship typically provide high 
availability. Process-group pairs 24 that have an active 
hot -standby paired relationship typically provide very high 
availability. 

It will be obvious to those having ordinary skill in the art 
50 that primary and alternate process groups could be arranged 
in a lead -active/active paired relationship, analogous to the 
Active 36/Standby 34 relationship described above, without 
departing from the scope of this invention. Such lead-active/ 
active process-group pairs typically provide ultra-high avail - 
55 ability. 

Application layer resources on a single computing system 
are initialized before initializing primary or alternate process 
groups 20 or 22. Failure to bring a critical application layer 
resource in service results in taking the dependent applica- 

60 tion layer process group 18, 20, or 22 out of service by 
transitioning it to the Unavailable state 30, and activating the 
dependent process group's 20, or 22 paired process group 20 
or 22, if one exists, on a separate computer 6. 
Activated platform layer and application layer process 

65 groups 16, 18, 20, and 22 handshake with the process group 
manager 10 to acknowledge activation to Active 36 or 
Standby 34. 
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Process-group pairs 24 can be provided only when plat- 
form layer process groups 16 are operating normally on the 
computers 6 hosting both the primary process group 20 and 
the alternate process group 22 that are be contained in the 
process-group pair 24. 

Upon a primary or alternate process group's 20 or 22 
failure to acknowledge initiation of activation by the process 
group manager 10, the primary or alternate process group 20 
or 22 for which activation was initiated is transitioned to the 
Unavailable state 30. Similarly, upon an application layer 
primary or alternate process group's 20 or 22 failure to 
acknowledge initiation of de-activation from Active 36 to 
Standby 34, the primary or alternate process group 20 or 22 
for which de-activation was initiated is transitioned to the 
Unavailable state 30. 

Although initiated, activation of a standby process group 
20 or 22 does not occur until the Standby process group 20 
or 22 for which activation has been initiated acknowledges 
initiation of the activation by handshaking with the process 
group manager 10. 

Primary and alternate process groups 20 and 22 belonging 
to the same process-group pair 24 can be put in the Off-line 
state 38 for maintenance without impacting other process 
groups 18, 20, and 22, on any of the loosely coupled 
computers 6, 

Separate application layer process groups 18, 20, and 22, 
running on the same computer 6, or on different computers 
6, can host dissimilar external applications, such as one or 
more application layer process groups 18, 20, or 22 con- 
trolling Code Division Multiple Access (CDMA) cellular 
telephone call processing, one or more process groups 18, 
20, or 22 controlling Time Division Multiple Access 
(TDMA) cellular telephone call processing, one or more 
process groups 18, 20, or 22 controlling Group Special 
Mobile (GSM) cellular telephone call processing, one or 
more process groups 18, 20, or 22 controlling Cellular 
Digital Packet Data (CDPD) cellular telephone call 
processing, and one or more process groups 18, 20, or 22 
controlling Analog Mobile Phone Service (AMPS) cellular 
telephone call processing. Similarly, separate application 
layer process groups 18, 20, and 22, running on the same 
computer 6, or on different computers 6, can host dissimilar 
external applications. 

As will be obvious to those having ordinary skill in the art, 
this invention provides the flexibility to configure one or 
more process-group pairs 24 across two or more loosely 
coupled computers 6 in several high availability computing 
element 6 configurations including active/standby, active/ 
active, and a typical n+k sparing arrangement. 

We claim: 

1. A method for providing high availability applications 
comprising, in combination, the steps of: 

a. running, on at least two computers, one or more process 
groups, at least one of said process groups containing 
one or more processes that have a fault recovery 
strategy common to said at least one of said process 
groups; 

b. running, on at least one of said at least two computers, 
a process group manager that initiates a fault recovery 
strategy for at least one of said one or more process 
groups; 

c. providing, on at least a first of said at least two 
computers, a system layer having at least one process 
group; 

d. taking all of said process groups out of service on said 
first computer upon a system layer process group fault 
occurring on said first computer; 
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e. re-booting said first computer upon a system layer 
process group fault occurring on said first computer; 

f . providing at least one paired process group on a second 
of said at least two computers, said paired process 
group being paired with one of said one or more 
process groups on said first computer; and 

g. activating said paired process group on said second 
computer upon a system layer process group fault 
occurring on said first computer. 

2. The method of claim 1 further comprising, in 
combination, the step of: using an independent computer to 
power cycle said first computer upon failure of said 
re -booting of said first computer to cure said system layer 
process group fault. 

3. The method of claim 2 further comprising, in 
combination, the step of: using an independent computer to 
power down said first computer upon failure of said power 
cycling of said first computer to cure said system layer 
process group fault. 

4. A method for providing high availability applications 
comprising, in combination, the steps of: 

a. running, on at least two computers, one or more process 
groups, at least one of said process groups containing 
one or more processes that have a fault recovery 
strategy common to said at least one of said process 
groups; 

b. running, on at least one of said at least two computers, 
a process group manager that initiates a fault recovery 
strategy for at least one of said one or more process 
groups; 

c. providing, on at least a first of said at least two 
computers, a system layer having at least one process 
group; 

d. taking all of said process groups out of service on said 
first computer upon a fault in a resource depended upon 
by at least one of said system layer process groups on 
said first computer; 

e. re-booting said first computer upon said fault in said 
resource depended upon by at least one of said system 
layer process groups on said first computer; 

f . providing at least one paired process group on a second 
of said at least two computers, said paired process 
group being paired with one of said process groups on 
said first computer; and 

g. activating said paired process group on said second 
computer upon said fault in said resource depended 
upon by at least one of said system layer process groups 
on said first computer. 

5. The method of claim 4 further comprising, in 
combination, the step of: using an independent computer to 
power cycle said first computer upon failure of said 
re-booting of said first computer to cure said fault in said 
resource depended upon by at least one of said system layer 
process groups on said first computer. 

6. The method of claim 5 further comprising, in 
combination, the step of: using an independent computer to 
power down said first computer upon failure of said power 
cycling of said first computer to cure said fault in said 
resource depended upon by at least one of said system layer 
process groups on said first computer. 

7. A method for providing high availability applications 
comprising, in combination, the steps of: 

a. running, on at least two computers, one or more process 
groups, at least one of said process groups containing 
one or more processes that have a fault recovery 
strategy common to said at least one of said process 
groups; 
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b. running, on at least one of said at least two computers, 
a process group manager that initiates a fault recovery 
strategy for at least one of said one or more process 
groups; 

c. providing, on at least a first of said at least two 
computers, a system layer having at least one process 
group; 

d. taking all of said process groups out of service on said 
first computer upon a system layer process group fault 
occurring on said first computer; 

e. re-booting said first computer upon a system layer 
process group fault occurring on said first computer; 

f. providing, on at least said first computer, a platform 
layer having at least one process group; 



10 



g. taking all of said process groups, except each of said at *5 com t>ination, the steps of: 



g. taking all of said process groups, except each of said at 
least one process group in said system layer, out of 
service on said first computer upon a fault in a resource 
depended upon by at least one of said platform layer 
process groups on said first computer; 

h. providing at least one paired process group on a second 
of said at least two computers, said at least one paired 
process group being paired with one of said process 
groups on said first computer; and 

i. activating said at least one paired process group on said 
second computer upon said fault in said resource 
depended upon by at least one of said platform layer 
process groups on said first computer. 

13. The method of claim 12 further comprising, in 



least one process group in said system layer, out of 
service on said first computer upon a platform layer 
process group fault occurring on said first computer; 
b. providing at least one paired process group on a second 
of said at least two computers, said paired process 
group being paired with one of said process groups on 
said first computer; and 
i. activating said paired process group on said second 
computer upon said platform layer process group fault 
occurring on said first computer. 
8. The method of claim 7 further comprising, in 
combination, the step of: re-initializing all of said process 
groups, except each of said at least one process group in said 
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a. re-initializing said resource having said fault; and 

b. re-initializing all of said process groups, except each of 
said at least one process group in said system layer, on 
said first computer upon said fault in said resource 
depended upon by at least one of said platform layer 
process groups on said first computer. 

14. The method of claim 13 farther comprising, in 
combination, the step of: re -booting said first computer upon 
failure of said re-initializations to cure said fault in said 
resource depended upon by at least one of said platform 
layer process groups on said first computer. 

15. The method of claim 14 further comprising, in 
combination, the step of: using an independent computer to 



system layer, on said first computer upon said platform layer ^ powef cyde said first computer up0D failure of said 



process group fault occurring on said first computer. 

9. The method of claim 8 further comprising, in 
combination, the step of: re-booting said first computer upon 
failure of said re-initialization to cure said platform layer 
process group fault. 

10. The method of claim 9 further comprising, in 
combination, the step of: using an independent computer to 
power cycle said first computer upon failure of said 
re-booting of said first computer to cure said platform layer 
process group fault. 

U. The method of claim 10 further comprising, in 
combination, the step of: using an independent computer to 
power down said first computer upon failure of said power 
cycling of said first computer to cure said platform layer 
process group fault. 

12. A method for providing high availability applications 
comprising, in combination, the steps of: 

a. running, on at least two computers, one or more process 
groups, at least one of said process groups containing 
one or more processes that have a fault recovery 
strategy common to said at least one of said process 
groups; 

b. running, on at least one of said at least two computers, 
a process group manager that initiates a fault recovery 
strategy for at least one of said one or more process 
groups; 

c. providing, on at least a first of said at least two 
computers, a system layer having at least one process 
group; 

d. taking all of said process groups out of service on said 
first computer upon a system layer process group fault 
occurring on said first computer; 

e. re-booting said first computer upon a system layer 
process group fault occurring on said first computer; 

f. providing, on at least a first of said at least two 
computers, a platform layer having at least one process 
group; 



re-booting of said first computer to cure said fault in said 
resource depended upon by at least one of said platform 
layer process groups on said first computer. 

16. The method of claim 15 further comprising, in 
35 combination, the step of: using an independent computer to 

power down said first computer upon failure of said power 
cycling of said first computer to cure said fault in said 
resource depended upon by at least one of said platform 
layer process groups on said first computer. 

17. A method for providing high availability applications 
comprising, in combination, the steps of: 

a. running, on at least two computers, one or more process 
groups, at least one of said process groups containing 
one or more processes that have a fault recovery 
strategy common to said at least one of said process 
groups; 

b. running, on at least one of said at least two computers, 
a process group manager that initiates a fault recovery 
strategy for at least one of said one or more process 
groups; 

c. providing, on at least a first of said at least two 
computers, a system layer having at least one process 
group; 

d. taking all of said process groups out of service on said 
first computer upon a system layer process group fault 
occurring on said first computer; 

e. re-booting said first computer upon a system layer 
process group fault occurring on said first computer; 

f. providing, on at least a first of said at least two 
computers, a platform layer having at least one process 
group; 

g. restarting at least one of said platform layer process 
groups upon a platform layer process group fault occur- 
ring on said first computer; 

h. taking all of said process groups, except each of said at 
least one process group in said system layer, out of 
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service on said first computer upon failure of said 

re -start to cure said platform layer process group fault; 
i. providing at least one paired process group on a second 

of said at least two computers, said at least one paired 

process group being paired with one of said process 

groups on said first computer; and 
j. activating said at least one paired process group on said 

second computer upon failure of said re-start to cure 

said platform layer process group fault. 

18. The method of claim 17 further comprising, in 
combination, the step of: re-initializing all of said process 
groups, except each of said at least one process group in said 
system layer, on said first computer upon failure of said 
re-start to cure said platform layer process group fault. 

19. The method of claim 18 further comprising, in 
combination, the step of: re-booting said first computer upon 
failure of said re-initialization to cure said platform layer 
process group fault. 

20. The method of claim 19 further comprising, in 
combination, the step of: using an independent computer to 
power cycle said first computer upon failure of said 
re-booting of said first computer to cure said platform layer 
process group fault. 

21. The method of claim 20 further comprising, in 
combination, the step of: using an independent computer to 
power down said first computer upon failure of said power 
cycling of said first computer to cure said platform layer 
process group fault. 

22. A method for providing high availability applications 
comprising, in combination, the steps of: 

a. running, on at least two computers, one or more process 
groups, at least one of said process groups containing 
one or more processes that have a fault recovery 
strategy common to said at least one of said process 
groups; 

b. running, on at least one of said at least two computers, 
a process group manager that initiates a fault recovery 
strategy for at least one of said one or more process 
groups; 

c. providing, on at least said first computer, a system layer 
having at least one process group; 

d. taking all of said process groups out of service on said 
first computer upon a system layer process group fault 
occurring on said first computer; 

e. re-booting said first computer upon a system layer 
process group fault occurring on said first computer; 

f. providing, on at least a first of said at least two 
computers, an application layer having at least one 
process group; 

g. talcing said at least one application layer process group 
out of service on said first computer upon a fault in said 
at least one application layer process group on said first 
computer; 

h. providing at least one paired process group on a second 
of said at least two computers, said paired process 
group being paired with one of said at least one 
application layer process group taken out of service on 
said first computer; 

i. activating said paired process group on said second 
computer upon said fault in said at least one application 
layer process group on said first computer; and 

j. re-initializing all of said process groups, except each of 
said at least one process group in said system layer, on 
said first computer upon not being able to take said 
application layer process group having said fault out of 
service on said first computer. 
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23. The method of claim 22 further comprising, in 
combination, the step of: re-booting said first computer upon 
failure of said re-initialization to cure said application layer 
process group fault. 

24. The method of claim 23 further comprising, in 
combination, the step of: using an independent computer to 
power cycle said first computer upon failure of said re-boot 
of said first computer to cure said application layer process 
group fault. 

25. The method of claim 24 further comprising, in 
combination, the step of: using an independent computer to 
power down said first computer upon failure of said power 
cycling of said first computer to cure said application layer 
process group fault. 

26. A method for providing high availability applications 
comprising, in combination, the steps of: 

a. running, on at least two computers, one or more process 
groups, at least one of said process groups containing 
one or more processes that have a fault recovery 
strategy common to said at least one of said process 
groups; 

b. running, on at least one of said at least two computers, 
a process group manager that initiates a fault recovery 
strategy for at least one of said one or more process 
groups; 

c. providing, on at least a first of said at least two 
computers, an application layer having at least two 
process groups; 

d. defining a dependency by at least a first of said at least 
two application layer process groups upon at least a 
second of said at least two application layer process 
groups; 

e. taking said first and said second application layer 
process groups out of service on said first computer 
upon a fault in said second application layer process 
group on said first computer; 

f. providing at least one paired process group on a second 
of said at least two computers, said paired process 
group being paired with said second application layer 
process group on said first computer; and 

g. activating said paired process group on said second 
computer upon said fault in said second application 
layer process group on said first computer. 

27. The method of claim 26 further comprising, in 
combination, the steps of: 

a. providing, on at least said first computer, a system layer 
having at least one process group; 

b. taking all of said process groups out of service on said 
first computer upon a system layer process group fault 
occurring on said first computer; 

c. re-booting said first computer upon a system layer 
process group fault occurring on said first computer; 
and 

d. re-initializing all of said process groups, except each of 
said at least one process group in said system layer, on 
said first computer upon not being able to take said first 
application layer process group out of service on said 
first computer. 

28. The method of claim 27 further comprising, in 
combination, the step of: re -booting said first computer upon 
failure of said re-initialization to cure said fault in said 
second application layer process group, 

29. The method of claim 28 further comprising, in 
combination, the step of: using an independent computer to 
power cycle said first computer upon failure of said re-boot 
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of said first computer to cure said fault in said second 
application layer process group. 

30. A method for providing high availability applications 
comprising, in combination, the steps of: 

a. running, on at least two computers, one or more process 
groups, at least one of said process groups containing 
one or more processes that have a fault recovery 
strategy common to said at least one of said process 
groups; 

b. running, on at least one of said at least two computers, 
a process group manager that initiates a fault recovery 
strategy for at least one of said one or more process 
groups; 

c. providing, on at least a first of said at least two 
computers, an application layer having at least one 
process group; 

d. talcing said at least one application layer process group 
out of service on said first computer upon a fault in a 
resource depended upon by said at least one application 
layer process group; 

e. providing at least one paired process group on a second 
of said at least two computers, said paired process 
group being paired with one of said at least one 
application layer process group on said first computer; 
and 

f. activating said paired process group on said second 
computer upon said fault in said resource depended 
upon by said at least one application layer process 
group. 

31. The method of claim 30 further comprising, in 
combination, the steps of: 

a. providing, on at least said first computer, a system layer 
having at least one process group; 

b. taking all of said process groups out of service on said 
first computer upon a system layer process group fault 
occurring on said first computer, 

c. re-booting said first computer upon a system layer 
process group fault occurring on said first computer; 

d. re -initializing said resource having said fault; and 

e. re-initializing all of said process groups, except each of 
said at least one process group in said system layer, on 
said first computer upon not being able to take said at 
least one application layer process group out of service 
on said first computer. 

32. The method of claim 31 further comprising, in 
combination, the step of: re-booting said first computer upon 
failure of said re-initializations to cure said fault in said 
resource depended upon by said at least one application 
layer process group. 

33. The method of claim 32 further comprising, in 
combination, the step of: using an independent computer to 
power cycle said first computer upon failure of said re -boot 
of said first computer to cure said fault in said resource 
depended upon by said at least one application layer process 
group. 

34. The method of claim 33 further comprising, in 
combination, the step of: using an independent computer to 
power down said first computer upon failure of said power 
cycling of said first computer to cure said fault in said 
resource depended upon by said at least one application 
layer process group. 

35. A method for providing high availability applications 
comprising, in combination, the steps of: 

a. running, on at least two computers, one or more process 
groups, at least one of said process groups containing 



one or more processes that have a fault recovery 
strategy common to said at least one of said process 
groups; 

b. running, on at least one of said at least two computers, 
a process group manager that initiates a fault recovery 
strategy for at least one of said one or more process 
groups; 

c. providing, on at least said first computer, a system layer 
having at least one process group; 

d. taking all of said process groups out of service on said 
first computer upon a system layer process group fault 
occurring on said first computer; 

e. re-booting said first computer upon a system layer 
process group fault occurring on said first computer; 

f. providing, on at least a first of said at least two 
computers, an application layer having at least one 
process group; 

g. re-starting said at least one application layer process 
group on said first computer upon a fault in said at least 
one application layer process group; 

h. taking said at least one application layer process group 
out of service on said first computer upon failure of said 
re-start to cure said fault in said at least one application 
layer process group; 

i. providing at least one paired process group on a second 
of said at least two computers, said paired process 
group being paired with one of said at least one 
application layer process group taken out of service on 
said first computer; 

j. activating said paired process group on said second 
computer upon failure of said re-start to cure said fault 
in said at least one application layer process group 
taken out of service on said first computer; and 
k. re-initializing all of said process groups, except each of 
said at least one process group in said system layer, on 
said first computer upon not being able to take said 
application layer process group having said fault out of 
service on said first computer. 

36. The method of claim 35 further comprising, in 
combination, the step of: re-booting said first computer upon 
failure of said re-initialization to cure said application layer 
process group fault. 

37. The method of claim 36 further comprising, in 
combination, the step of: using an independent computer to 
power cycle said first computer upon failure of said re-boot 
of said first computer to cure said application layer process 
group fault. 

38. The method of claim 37 further comprising, in 
combination, the step of: using an independent computer to 
power down said first computer upon failure of said power 
cycling of said first computer to cure said application layer 
process group fault. 

39. The method of claim 38 further comprising, in 
combination, the step of: using an independent computer to 
power down said first computer upon failure of said power 
cycling of said first computer to cure said fault in said 
second application layer process group. 

40. A method for providing high availability applications 
comprising, in combination, the steps of: 

a. running, on at least two computers, one or more process 
6o groups, at least one of said process groups containing 

one or more processes that have a fault recovery 
strategy common to said at least one of said process 
groups; 

b. running, on at least one of said at least two computers, 
65 a process group manager that initiates a fault recovery 

strategy for at least one of said one or more process 
groups; 
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c. providing, on at least a first of said at least two 
computers, an application layer having at least two 
process groups; 

d. defining a dependency by at least a first of said at least 
two application layer process groups upon at least a 5 
second of said at least two application layer process 
groups; 

e. re-starting said second application layer process group 
on said first computer upon a fault in said second 
application layer process group; 

f. taking said first and said second application layer 
process groups out of service on said first computer 
upon failure of said re -start to cure said fault in said 
second application layer process group; 15 

g. providing at least one paired process group on a second 
of said at least two computers, said paired process 
group being paired with said first application layer 
process group on said first computer; and 

h. activating said at least one paired process group on said 20 
second computer upon failure of said re-start to cure 
said fault in said second application layer process 
group. 

41. The method of claim 40 further comprising, in 
combination, the steps of: 25 

a. providing, on at least said first computer, a system layer 
having at least one process group; 

b. taking all of said process groups out of service on said 
first computer upon a system layer process group fault 
occurring on said first computer; 

c. re-booting said first computer upon a system layer 
process group fault occurring on said first computer; 
and 

d. re-initializing all of said process groups, except each of ^ 
said at least one process group in said system layer, on 
said first computer upon not being able to take said first 
application layer process group out of service on said 
first computer. 

42. The method of claim 41 further comprising, in 4Q 
combination, the step of: re-booting said first computer upon 
failure of said re-initialization to cure said fault in said 
second application layer process group. 

43. The method of claim 42 further comprising, in 
combination, the step of: using an independent computer to ^ 
power cycle said first computer upon failure of said re-boot 

of said first computer to cure said fault in said second 
application layer process group. 

44. The method of claim 43 further comprising, in 
combination, the step of: using an independent computer to 5Q 
power down said first computer upon failure of said power 
cycling of said first computer to cure said fault in said 
second application layer process group. 

45. An apparatus for providing high availability applica- 
tions comprising, in combination: 55 

a. means for running, on at least two computers, one or 
more process groups, at least one of said process groups 
containing one or more processes that have a fault 
recovery strategy common to said at least one of said 
process groups; 60 

b. means for running, on at least one of said at least two 
computers, a process group manager that initiates a 
fault recovery strategy for at least one of said one or 
more process groups; 

c. means for providing, on at least a first of said at least 65 
two computers, a system layer having at least one 
process group; 



d. means for taking all of said process groups out of 
service on said first computer upon a system layer 
process group fault occurring on said first computer; 

e. means for re -booting said first computer upon a system 
layer process group fault occurring on said first com- 
puter; 

f. means for providing at least one paired process group 
on a second of said at least two computers, said paired 
process group being paired with one of said one or 
more process groups on said first computer; and 

g. means for activating said paired process group on said 
second computer upon a system layer process group 
fault occurring on said first computer. 

46. The apparatus of claim 45 further comprising, in 
combination: means for using an independent computer to 
power cycle said first computer upon failure of said 
re-booting of said first computer to cure said system layer 
process group fault. 

47. The apparatus of claim 46 further comprising, in 
combination: means for using an independent computer to 
power down said first computer upon failure of said power 
cycling of said first computer to cure said system layer 
process group fault. 

48. An apparatus for providing high availability applica- 
tions comprising, in combination: 

a. means for running, on at least two computers, one or 
more process groups, at least one of said process groups 
containing one or more processes that have a fault 
recovery strategy common to said at least one of said 
process groups; 

b. means for running, on at least one of said at least two 
computers, a process group manager that initiates a 
fault recovery strategy for at least one of said one or 
more process groups; 

c. means for providing, on at least a first of said at least 
two computers, a system layer having at least one 
process group; 

d. means for taking all of said process groups out of 
service on said first computer upon a fault in a resource 
depended upon by at least one of said system layer 
process groups on said first computer; 

e. means for re-booting said first computer upon said fault 
in said resource depended upon by at least one of said 
system layer process groups on said first computer, 

f. means for providing at least one paired process group 
on a second of said at least two computers, said paired 
process group being paired with one of said process 
groups on said first computer; and 

g. means for activating said paired process group on said 
second computer upon said fault in said resource 
depended upon by at least one of said system layer 
process groups on said first computer. 

49. The apparatus of claim 48 further comprising, in 
combination: means for using an independent computer to 
power cycle said first computer upon failure of said 
re -booting of said first computer to cure said fault in said 
resource depended upon by at least one of said system layer 
process groups on said first computer. 

50. The apparatus of claim 49 further comprising, in 
combination: means for using an independent computer to 
power down said first computer upon failure of said power 
cycling of said first computer to cure said fault in said 
resource depended upon by at least one of said system layer 
process groups on said first computer. 

51. An apparatus for providing high availability applica- 
tions comprising, in combination: 
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a. means for running, on at least two computers, one or 
more process groups, at least one of said process groups 
containing one or more processes that have a fault 
recovery strategy common to said at least one of said 
process groups; 

b. means for running, on at least one of said at least two 
computers, a process group manager that initiates a 
fault recovery strategy for at least one of said one or 
more process groups; 

c. means for providing, on at least a first of said at least 
two computers, a system layer having at least one 
process group; 

d. means for taking all of said process groups out of 
service on said first computer upon a system layer 
process group fault occurring on said first computer; 

e. means for re-booting said first computer upon a system 
layer process group fault occurring on said first com- 
puter; 

f. means for providing, on at least said first computer, a 
platform layer having at least one process group; 

g. means for taking all of said process groups, except each 
of said at least one process group in said system layer, 
out of service on said first computer upon a platform 
layer process group fault occurring on said first com- 
puter; 

h. means for providing at least one paired process group 
on a second of said at least two computers, said paired 
process group being paired with one of said process 
groups on said first computer; and 

i. means for activating said paired process group on said 
second computer upon said platform layer process 
group fault occurring on said first computer. 

52. The apparatus of claim 51 further comprising, in 
combination: means for re- initializing all of said process 
groups, except each of said at least one process group in said 
system layer, on said first computer upon said platform layer 
process group fault occurring on said first computer. 

53. The apparatus of claim 52 further comprising, in 
combination: means for re-booting said first computer upon 
failure of said re-initialization to cure said platform layer 
process group fault. 

54. The apparatus of claim 53 further comprising, in 
combination: means for using an independent computer to 
power cycle said first computer upon failure of said 
re-booting of said first computer to cure said platform layer 
process group fault. 

55. The apparatus of claim 54 further comprising, in 
combination: means for using an independent computer to 
power down said first computer upon failure of said power 
cycling of said first computer to cure said platform layer 
process group fault. 

56. An apparatus for providing high availability applica- 
tions comprising, in combination: 

a. means for running, on at least two computers, one or 
more process groups, at least one of said process groups 
containing one or more processes that have a fault 
recovery strategy common to said at least one of said 
process groups; 

b. means for running, on at least one of said at least two 
computers, a process group manager that initiates a 
fault recovery strategy for at least one of said one or 
more process groups; 

c. means for providing, on at least a first of said at least 
two computers, a system layer having at least one 
process group; 
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d. means for taking all of said process groups out of 
service on said first computer upon a system layer 
process group fault occurring on said first computer; 

e. means for re-booting said first computer upon a system 
layer process group fault occurring on said first com- 
puter; 

f. means for providing, on at least a first of said at least 
two computers, a platform layer having at least one 
process group; 

g. means for taking all of said process groups, except each 
of said at least one process group in said system layer, 
out of service on said first computer upon a fault in a 
resource depended upon by at least one of said platform 
layer process groups on said first computer; 

h. means for providing at least one paired process group 
on a second of said at least two computers, said at least 
one paired process group being paired with one of said 
process groups on said first computer; and 

i. means for activating said at least one paired process 
group on said second computer upon said fault in said 
resource depended upon by at least one of said platform 
layer process groups on said first computer. 

57. The apparatus of claim 56 further comprising, in 
combination: 

a. means for re -initializing said resource having said fault; 
and 

b. means for re- initializing all of said process groups, 
except each of said at least one process group in said 
system layer, on said first computer upon said fault in 
said resource depended upon by at least one of said 
platform layer process groups on said first computer. 

58. The apparatus of claim 57 further comprising, in 
combination: means for re-booting said first computer upon 
failure of said re-initializations to cure said fault in said 
resource depended upon by at least one of said platform 
layer process groups on said first computer. 

59. The apparatus of claim 58 further comprising, in 
combination: means for using an independent computer to 
power cycle said first computer upon failure of said 
re-booting of said first computer to cure said fault in said 
resource depended upon by at least one of said platform 
layer process groups on said first computer. 

60. The apparatus of claim 59 further comprising, in 
combination: means for using an independent computer to 
power down said first computer upon failure of said power 
cycling of said first computer to cure said fault in said 
resource depended upon by at least one of said platform 
layer process groups on said first computer. 

61. An apparatus for providing high availability applica- 
tions comprising, in combination: 

a. means for running, on at least two computers, one or 
more process groups, at least one of said process groups 
containing one or more processes that have a fault 
recovery strategy common to said at least one of said 
process groups; 

b. means for running, on at least one of said at least two 
computers, a process group manager that initiates a 
fault recovery strategy for at least one of said one or 
more process groups; 

c. means for providing, on at least a first of said at least 
two computers, a system layer having at least one 
process group; 

d. means for taking all of said process groups out of 
service on said first computer upon a system layer 
process group fault occurring on said first computer; 
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e. means for re -booting said first computer upon a system 
layer process group fault occurring on said first com- 
puter; 

f. means for providing, on at least a first of said at least 
two computers, a platform layer having at least one 5 
process group; 

g. means for restarting at least one of said platform layer 
process groups upon a platform layer process group 
fault occurring on said first computer; 

h. means for taking all of said process groups, except each 
of said at least one process group in said system layer, 
out of service on said first computer upon failure of said 
re-start to cure said platform layer process group fault; 

i. means for providing at least one paired process group on i$ 
a second of said at least two computers, said at least one 
paired process group being paired with one of said 
process groups on said first computer; and 

j. means for activating said at least one paired process 
group on said second computer upon failure of said 20 
re-start to cure said platform layer process group fault. 

62. The apparatus of claim 61 further comprising, in 
combination: means for re-initializing all of said process 
groups, except each of said at least one process group in said 
system layer, on said first computer upon failure of said 25 
re-start to cure said platform layer process group fault. 

63. The apparatus of claim 62 further comprising, in 
combination: means for re-booting said first computer upon 
failure of said re-initialization to cure said platform layer 
process group fault. 30 

64. The apparatus of claim 63 further comprising, in 
combination: means for using an independent computer to 
power cycle said first computer upon failure of said 
re-booting of said first computer to cure said platform layer 
process group fault. 35 

65. The apparatus of claim 64 further comprising, in 
combination: means for using an independent computer to 
power down said first computer upon failure of said power 
cycling of said first computer to cure said platform layer 
process group fault. 40 

66. An apparatus for providing high availability applica- 
tions comprising, in combination: 

a. means for running, on at least two computers, one or 
more process groups, at least one of said process groups 
containing one or more processes that have a fault 45 
recovery strategy common to said at least one of said 
process groups; 

b. means for running, on at least one of said at least two 
computers, a process group manager that initiates a 
fault recovery strategy for at least one of said one or 50 
more process groups; 

c. means for providing, on at least said first computer, a 
system layer having at least one process group; 

d. means for taking all of said process groups out of 55 
service on said first computer upon a system layer 
process group fault occurring on said first computer; 

e. means for re-booting said first computer upon a system 
layer process group fault occurring on said first com- 
puter; 60 

f. means for providing, on at least a first of said at least 
two computers, an application layer having at least one 
process group; 

g. means for taking said at least one application layer 
process group out of service on said first computer 65 
upon a fault in said at least one application layer 
process group on said first computer; 
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h. means for providing at least one paired process group 
on a second of said at least two computers, said paired 
process group being paired with one of said at least one 
application layer process group taken out of service on 
said first computer; 

i. means for activating said paired process group on said 
second computer upon said fault in said at least one 
application layer process group on said first computer; 
and 

j. means for re-initializing all of said process groups, 
except each of said at least one process group in said 
system layer, on said first computer upon not being able 
to take said application layer process group having said 
fault out of service on said first computer. 

67. The apparatus of claim 66 further comprising, in 
combination: means for re-booting said first computer upon 
failure of said re-initialization to cure said application layer 
process group fault. 

68. The apparatus of claim 67 further comprising, in 
combination: means for using an independent computer to 
power cycle said first computer upon failure of said re-boot 
of said first computer to cure said application layer process 
group fault. 

69. The apparatus of claim 68 further comprising, in 
combination: means for using an independent computer to 
power down said first computer upon failure of said power 
cycling of said first computer to cure said application layer 
process group fault. 

70. An apparatus for providing high availability applica- 
tions comprising, in combination: 

a. means for running, on at least two computers, one or 
more process groups, at least one of said process groups 
containing one or more processes that have a fault 
recovery strategy common to said at least one of said 
process groups; 

b. means for running, on at least one of said at least two 
computers, a process group manager that initiates a 
fault recovery strategy for at least one of said one or 
more process groups; 

c. means for providing, on at least a first of said at least 
two computers, an application layer having at least two 
process groups; 

d. means for defining a dependency by at least a first of 
said at least two application layer process groups upon 
at least a second of said at least two application layer 
process groups; 

e. means for taking said first and said second application 
layer process groups out of service on said first com- 
puter upon a fault in said second application layer 
process group on said first computer; 

f. means for providing at least one paired process group 
on a second of said at least two computers, said paired 
process group being paired with said second applica- 
tion layer process group on said first computer; and 

g. means for activating said paired process group on said 
second computer upon said fault in said second appli- 
cation layer process group on said first computer. 

71. The apparatus of claim 70 further comprising, in 
combination: 

a. means for providing, on at least said first computer, a 
system layer having at least one process group; 

b. means for taking all of said process groups out of 
service on said first computer upon a system layer 
process group fault occurring on said first computer; 

c. means for re-booting said first computer upon a system 
layer process group fault occurring on said first com- 
puter; and 
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d. means for re-initializing all of said process groups, 
except each of said at least one process group in said 
system layer, on said first computer upon not being able 
to take said first application layer process group out of 
service on said first computer. 

72. The apparatus of claim 71 further comprising, in 
combination: means for re-booting said first computer upon 
failure of said re-initialization to cure said fault in said 
second application layer process group. 

73. The apparatus of claim 72 further comprising, in 
combination: means for using an independent computer to 
power cycle said first computer upon failure of said re-boot 
of said first computer to cure said fault in said second 
application layer process group. 

74. The apparatus of claim 73 further comprising, in 
combination: means for using an independent computer to 
power down said first computer upon failure of said power 
cycling of said first computer to cure said fault in said 
second application layer process group. 

75. An apparatus for providing high availability applica- 
tions comprising, in combination: 

a. means for running, on at least two computers, one or 
more process groups, at least one of said process groups 
containing one or more processes that have a fault 
recovery strategy common to said at least one of said 
process groups; 

b. means for running, on at least one of said at least two 
computers, a process group manager that initiates a 
fault recovery strategy for at least one of said one or 
more process groups; 

c. means for providing, on at least a first of said at least 
two computers, an application layer having at least one 
process group; 

d. means for taking said at least one application layer 
process group out of service on said first computer 
upon a fault in a resource depended upon by said at 
least one application layer process group; 

e. means for providing at least one paired process group 
on a second of said at least two computers, said paired 
process group being paired with one of said at least one 
application layer process group on said first computer; 
and 

f. means for activating said paired process group on said 
second computer upon said fault in said resource 
depended upon by said at least one application layer 
process group. 

76. The apparatus of claim 75 further comprising, in 
combination: 

a. means for providing, on at least said first computer, a 
system layer having at least one process group; 

b. means for taking all of said process groups out of 
service on said first computer upon a system layer 
process group fault occurring on said first computer; 

c. means for re-booting said first computer upon a system 
layer process group fault occurring on said first com- 
puter; 

d. means for re -initializing said resource having said fault; 
and 

e. means for re- initializing all of said process groups, 
except each of said at least one process group in said 
system layer, on said first computer upon not being able 
to take said at least one application layer process group 
out of service on said first computer 

77. The apparatus of claim 76 further comprising, in 
combination: means for re-booting said first computer upon 
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failure of said re-initializations to cure said fault in said 
resource depended upon by said at least one application 
layer process group. 

78. The apparatus of claim 77 further comprising, in 
combination: means for using an independent computer to 
power cycle said first computer upon failure of said re-boot 
of said first computer to cure said fault in said resource 
depended upon by said at least one application layer process 
group. 

79. The apparatus of claim 78 further comprising, in 
combination: means for using an independent computer to 
power down said first computer upon failure of said power 
cycling of said first computer to cure said fault in said 
resource depended upon by said at least one application 
layer process group. 

80. An apparatus for providing high availability applica- 
tions comprising, in combination: 

a. means for running, on at least two computers, one or 
more process groups, at least one of said process groups 
containing one or more processes that have a fault 
recovery strategy common to said at least one of said 
process groups; 

b. means for running, on at least one of said at least two 
computers, a process group manager that initiates a 
fault recovery strategy for at least one of said one or 
more process groups; 

c. means for providing, on at least said first computer, a 
system layer having at least one process group; 

d. means for taking all of said process groups out of 
service on said first computer upon a system layer 
process group fault occurring on said first computer; 

e. means for re-booting said first computer upon a system 
layer process group fault occurring on said first com- 
puter; 

f. means for providing, on at least a first of said at least 
two computers, an application layer having at least one 
process group; 

g. means for re-starting said at least one application layer 
process group on said first computer upon a fault in said 
at least one application layer process group; 

h. means for taking said at least one application layer 
process group out of service on said first computer 
upon failure of said re-start to cure said fault in said at 
least one application layer process group; 

i. means for providing at least one paired process group on 
a second of said at least two computers, said paired 
process group being paired with one of said at least one 
application layer process group taken out of service on 
said first computer; 

j. means for activating said paired process group on said 
second computer upon failure of said re-start to cure 
said fault in said at least one application layer process 
group taken out of service on said first computer; and 

k. means for re-initializing all of said process groups, 
except each of said at least one process group in said 
system layer, on said first computer upon not being able 
to take said application layer process group having said 
fault out of service on said first computer. 

81. The apparatus of claim 80 further comprising, in 
combination: means for re-booting said first computer upon 
failure of said re-initialization to cure said application layer 
process group fault. 

82. The apparatus of claim 81 further comprising, in 
combination: means for using an independent computer to 
power cycle said first computer upon failure of said re-boot 
of said first computer to cure said application layer process 
group fault. 
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h. means for activating said at least one paired process 
group on said second computer upon failure of said 
re-start to cure said fault in said second application 
layer process group. 

85. The apparatus of claim 84 further comprising, in 
combination: 

a. means for providing, on at least said first computer, a 
system layer having at least one process group; 

b. means for taking all of said process groups out of 
service on said first computer upon a system layer 
process group fault occurring on said first computer; 

c. means for re-booting said first computer upon a system 
layer process group fault occurring on said first com- 
puter; and 

d. means for re- initializing all of said process groups, 
except each of said at least one process group in said 
system layer, on said first computer upon not being able 
to take said first application layer process group out of 
service on said first computer. 

86. The apparatus of claim 85 further comprising, in 
combination: means for re-booting said first computer upon 
failure of said re-initialization to cure said fault in said 
second application layer process group. 

87. The apparatus of claim 86 further comprising, in 
means for re-starting said second application layer 25 combinatioa: mea ns for using an independent computer to 



83. The apparatus of claim 82 further comprising, in 
combination: means for using an independent computer to 
power down said first computer upon failure of said power 
cycling of said first computer to cure said application layer 
process group fault. 5 

84. A method for providing high availability applications 
comprising, in combination: 

a. means for running, on at least two computers, one or 
more process groups, at least one of said process groups 
containing one or more processes that have a fault 10 
recovery strategy common to said at least one of said 
process groups; 

b. means for running, on at least one of said at least two 
computers, a process group manager that initiates a 
fault recovery strategy for at least one of said one or 
more process groups; 

c. means for providing, on at least a first of said at least 
two computers, an application layer having at least two 
process groups; 

d. means for defining a dependency by at least a first of 
said at least two application layer process groups upon 
at least a second of said at least two application layer 
process groups; 
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process group on said first computer upon a fault in said 
second application layer process group; 

f. means for taking said first and said second application 
layer process groups out of service on said first com- 
puter upon failure of said re-start to cure said fault in 
said second application layer process group; 

g. means for providing at least one paired process group 
on a second of said at least two computers, said paired 
process group being paired with said first application 
layer process group on said first computer; and 
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power cycle said first computer upon failure of said re-boot 
of said first computer to cure said fault in said second 
application layer process group. 

88. The apparatus of claim 87 further comprising, in 
combination: means for using an independent computer to 
power down said first computer upon failure of said power 
cycling of said first computer to cure said fault in said 
second application layer process group. 
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